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ABSTRACT 

This  thesis  describes  the  development  and  application  of  a  new  tool  for  profiling 
marine  microbial  communities.  Chapter  1  places  the  tool  in  the  context  of  the 
range  of  methods  used  currently.  Chapter  2  describes  the  development  and 
validation  of  the  “genome  proxy"  microarray,  which  targeted  marine  microbial 
genomes  and  genome  fragments  using  sets  of  70-mer  oligonucleotide  probes.  In 
a  natural  community  background,  array  signal  was  highly  linearly  correlated  to 
target  cell  abundance  (R^  of  1.0),  with  a  dynamic  range  from  10-10®  cells/ml. 
Genotypes  with  >~80%  average  nucleotide  identity  to  those  targeted  cross- 
hybridized  to  target  probesets  but  produced  distinct,  diagnostic  patterns  of 
hybridization.  Chapter  3  describes  the  development  an  expanded  array,  targeting 
268  microbial  genotypes,  and  its  use  in  profiling  57  samples  from  Monterey  Bay. 
Comparison  of  array  and  pyrosequence  data  for  three  samples  showed  a  strong 
linear  correlation  between  target  abundance  using  the  two  methods  (R^=0.85- 
0.91 ).  Array  profiles  clustered  into  shallow  versus  deep,  and  the  majority  of 
targets  showed  depth-specific  distributions  consistent  with  previous  observations. 
Although  no  correlation  was  observed  to  oceanographic  season,  bloom 
signatures  were  evident.  Array-based  insights  into  population  structure 
suggested  the  existence  of  ecotypes  among  uncultured  clades.  Chapter  4 
summarizes  the  work  and  discusses  future  directions. 
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Chapter  1 


Community  profiling  methods  in  microbial  ecology. 


9 


Microorganisms  drive  global  biogeochemical  cycles,  and  are  major 
numerical  and  biomass  components  of  every  habitat  on  the  planet.  Microbial 
ecology  seeks  to  understand  the  relationships  between  microbial  communities 
and  their  surrounding  environment.  However,  these  communities  are  complex, 
microscopic,  and  difficult  to  observe,  and  it  remains  an  ongoing  methodological 
and  conceptual  challenge  to  track  individual  taxa  within  communities  and  to 
profile  communities  as  a  whole.  A  range  of  methods  are  available  for  microbial 
community  profiling,  spanning  differing  degrees  of  phylogenetic  resolution, 
specificity,  sensitivity,  ease  of  use,  cost  of  adoption,  and  cost  per  sample.  In  this 
chapter,  I  present  an  overview  of  the  major  profiling  methods  (Figure  1),  to  place 
the  new  profiling  tool  developed  and  used  in  this  thesis  into  context.  This  review 
does  not  address  methods  for  functionally  profiling  communities  or  linking 
specific  taxa  to  ecosystem  functions,  but  rather  focuses  solely  on  methods  used 
to  assess  community  composition.  These  methods  address  questions  about  who 
is  there,  rather  than  what  they  are  doing. 

For  the  purposes  of  this  review,  it  will  be  assumed  that  the  microbial 
community  sample  has  been  prepared  or  obtained  in  a  way  that  primarily 
captures  the  microbial  component  of  the  biota  (since  the  larger  cells  can  swamp 
or  obscure  the  microbial  signal,  for  many  of  the  methods  discussed  here).  For 
example,  marine  samples  are  often  first  passed  through  a  1. 6-2.7  pm  nominal 
pore-size  pre-filter  to  remove  larger  eukaryotic  cells  (along  with  particle-attached 
or  larger  microbes)  and  then  collected  onto  a  0.2um  filter,  allowing  non-particle- 
associated  viruses  to  pass  through.  Soil  or  sediment  samples  can  be  sieved  to 
remove  roots  or  invertebrates,  while  symbiont  communities  are  dissected  away 
from  host  tissues.  Based  on  the  habitat  and  conditions  of  the  sample,  the 
chemical  details  of  subsequent  treatments  are  then  tailored  to  optimize 
efficiency,  e.g.  of  DNA  extraction. 

Once  a  microbial  community  sample  is  obtained,  there  are  four  distinct 
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strategies  for  profiling  the  community  composition  of  the  sample: 

1.  Chemical  fixation  and  whole-cell  characterization:  Cells  remain 
essentially  intact.  Fixation  strategies  vary  based  on  the  cell  types  and 
background  particles  involved.  Once  a  sample  is  chemically  fixed,  community 
members  can  be  analyzed  as  whole  cells  based  upon  their  light  scattering  and 
autofluorescent  emission  properties.  Fixation  can  be  followed  by  fluorescence  in 
situ  hybridization  (FISFI),  flow  cytometry  (FCM),  and/or  fluorescence-activated 
cell  sorting  (FACS).  These  methods  provide  the  most  direct  observations  of 
organisms,  and  will  be  discussed  further  below,  but  can  rarely  discriminate 
beyond  a  handful  of  members  or  cell-types  within  complex  microbial 
assemblages. 

2.  Lipid  Extraction:  A  sample’s  lipids  are  extracted,  subjected  to  specific 
chemical  manipulation,  and  structurally  profiled  using  high-pressure  liquid 
chromatography  (HPLC),  mass  spectroscopy  (MS),  and/or  other  methods.  Lipid 
profiling  has  been  a  central  tool  in  paleomicrobiology  and  in  associated 
paleoclimate  studies  (e.g.  reviewed  in  Summons  et  al.,  1996),  for  extinct 
communities  in  which  there  are  no  longer  intact  cells,  and  whose  DNA  may  be 
absent  or  excessively  degraded.  In  these  cases,  lipid  composition  is  used  to 
make  inferences  about  phylotype  presence.  It  has  also  been  used  to  generate 
overall  profiles  for  extant  communities,  (e.g.  Vestal  and  White,  1989,  Summons 
et  al.,  1996,  Schutter  and  Dick,  2000,  Ritchie  et  al.,  2000,  Hinojosa  et  al.,  2005, 
Moore-Kucera  and  Dick,  2008,  Jimenez  Esquilin  et  al.,  2008),  and  to  examine 
particular  groups  with  characteristic  lipids  (e.g.  annamox  bacteria,  Schmid  et  al., 
2005,  archaea,  Biddle  et  al.,  2006,  Ingalls  et  al.,  2006,  Thiel  et  al.,  2007) 
However,  because  of  ambiguities  in  the  robust  assignment  of  particular  lipid 
types  to  specific  clades  (e.g.  Rashby  et  al.,  2007),  and  because  cellular  lipid 
content  and  composition  can  alter,  like  ribosome  content,  depending  on  cellular 
physiology  (Vestal  and  White,  1989,  Villanueva  et  al.,  2004),  it  is  not  especially 
common  as  a  profiling  technique  for  extant  communities,  although  lipid  profiling 
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combined  with  isotope  characterization  is  used  to  infer  links  between  identity  and 
particular  metabolisms  (e.g.  Ingalls  et  al.,  2006).  Therefore  it  will  not  be  further 
discussed  in  this  review. 

3.  Cultivation.  Cultivation  typically  allows  the  characterization  of  <1%  of 
microbial  taxa,  particularly  in  oligotrophic  environments  (e.g.  Staley  and 
Konopka,  1985),  and  cultivars  are  often  rare  phylotypes  with  physiologies  quite 
different  from  those  characteristic  of  the  community  (the  standard  culturing 
process  often  selects  for  eutrophic  clades).  However,  some  clades  are  amenable 
to  cultivation  and  their  presence  and  diversity  in  a  sample  can  thus  be  partially 
profiled  by  cultivation  (e.g.  marine  Vibrio  spp.,  Thompson  etal.,  2005).  In 
addition,  recent  improvements  in  culturing,  using  conditions  closer  to  those  in 
situ,  have  led  to  a  greater  diversity  of  organisms  becoming  isolated  or  enriched 
(e.g.  Connon  and  Giovannoni,  2002,  Rappe  et  al,  2002,  Cho  and  Giovannoni, 
2004,  Stingl  etal.,  2008).  However,  cultivation  alone  has  yet  to  effectively 
capture  the  complexity  of  a  microbial  community  and  even  in  the  face  of  high- 
throughput  optimizations  remains  a  significant  time  commitment.  Thus,  while 
cultivation  is  critical  for  model-systems-based  study  of  particular  community 
members,  it  has  limited  use  as  a  community  profiling  tool.  Even  for  cultivation- 
amenable  clades,  cultured  representatives  rarely  reflect  in  situ  relative 
abundances,  and  thus  further  community  characterization  studies  are  required  to 
accurately  profile  the  original  sample  (Thompson  et  al.,  2005).  For  these 
reasons,  cultivation  will  not  be  further  described  here  (but  see  recent  review  by 
Giovannoni  and  Stingl,  2007). 

4.  Nucleic  acid  Extraction  and  characterization.  The  genetic  code  of  a 
mixture  of  organisms,  if  it  can  be  uniformly  accessed^,  is  a  powerful  means  to 

'  Optimization  of  DNA  extraction  protocols  to  the  particular  community  examined  is  critical. 
Particle  association,  sample  chemistry  (e.g.  humic  substances  in  soils,  waxes  in  some  plants, 
excessive  mucus  in  marine  invertebrate-associated  communities),  and  cell  surface  properties 
(e.g.,  cell  wall  composition,  spore  coats)  affect  extraction  efficiency  may  bias  community 
representation. 
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survey  a  complex  microbial  community.  Community  DNA  or  rRNA  extracted  from 
a  sample  may  require  amplification  before  further  analysis.  Gene-specific 
amplification  allows  profiling  of  microbial  diversity  via  a  phylogenetically 
informative  gene  of  interest,  often  the  16S  rRNA  gene  or  molecule.  Amplification 
might  be  done  quantitatively  (e.g.,  quantitative  PCR  or  RT-PCR)  to  assess 
relative  abundances  or  non-quantitatively  to  generate  sufficient  copies  of  the 
gene  of  interest  for  further  profiling  analyses.  In  this  latter  case,  amplified  genes 
may  be  “fingerprinted"  by  methods  ranging  from  terminal  restriction  fragment 
length  polymorphism  (tRFLP)  to  microarrays  to  sequencing.  Alternatively, 
community  profiling  can  occur  without  amplification  by  sampling  the  community 
genome  directly  through  sequencing,  either  with  or  without  cloning. 

This  range  of  commonly-used  community  profiling  methods,  from  FISH 
through  metagenomics  (Figure  1),  are  described  briefly  in  the  context  of  profiling 
and  compared  below. 

Fluorescence  in  situ  Hybridization  (FISH) 

Fluorescence  in  situ  hybridization  (FISH)  is  widely  used  in  microbial  ecology 
(for  a  recent  review,  see  Amann  and  Fuchs,  2008)  and  allows  the  enumeration 
and  profiling  of  phylogenetic  subsections  of  a  community  using  fluorescence 
microcopy.  Briefly,  microbial  cells  from  a  natural  community  sample  are 
chemically  fixed  and  permeabilized,  then  hybridized  to  fluorescently-labeled 
oligonucleotide  probes  complementary  to  e.g.  the  16S  rRNA  sequence  of  a  clade 
of  interest.  Fluorescently  labeled  cells  are  then  imaged  and  quantified  using 
fluorescence  microcopy  (see  e.g.  Amann  and  Fuchs,  2008).  Unlike  the  majority 
of  community  analysis  techniques,  FISH  examines  whole  cells  and  thus  provides 
additional  community  information  about  cellular  morphology  and  spatial 
structuring. 

A  functional  physiological  component  can  be  added  to  FISH  by  combining  it 
with  other  methods.  MAR-FISH  (aka  Micro-FISH,  STAR-FISH)  combines  FISH 
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with  microautoradiography  by  incubating  a  sample  with  radiolabeled  substrate 
diagnostic  for  a  metabolic  pathway  or  biogeochemical  process  (Lee  et  al.,  1999). 
Further  refinements  include  FISH-SIMS,  which  uses  secondary-ion  mass 
spectroscopy  to  detect  both  stable  and  radioactive  isotopes,  to  a  spatial 
resolution  of  10-1 5pm  (Orphan  et  al.,  2001 ).  The  resolution  of  such  methods 
continues  to  improve,  with  FISH-NanoSIMS  (50nm)  allowing  a  new  level  of 
precision  in  hypothesis-testing  (Lechene  eta!.,  2007). 

In  spite  of  the  considerable  strengths  of  FISH-related  profiling  methods, 
there  are  limitations:  (1)  variable  fixation  /  membrane  permeabilization  across  cell 
types,  (2)  detection  sensitivity  in  natural  samples,  (3)  variable  probe  specificity 
and  accessibility,  (4)  low  sample  throughput  and  (5)  high  background  and  sample 
autofluorescence.  (For  a  good  methodological  review  of  FISH,  see  Bottari  et  al., 
2006). 

Fixation  /  permeabilization:  Cell  wall  composition  varies  among  microbes, 
and  so  fixation  and  permeabilization  without  lysis  can  be  difficult  to  achieve 
across  the  range  of  cell  types  typically  present  in  a  community,  and  methods 
must  be  tailored  to  each  investigation.  Fixation  and  permeabilization  is  achieved 
with  paraformaldehyde  and  ethanol,  aided  by  enzymes  (lysozymes  and 
proteases),  solvents,  acids,  and  detergents  (Amann  and  Fuchs,  2008).  Not  only 
can  the  efficacy  of  these  steps  vary  greatly  among  cells,  and  thus  affect  target 
accessibility,  but  the  composition  of  co-purified  elements  of  the  community 
habitat  can  also  affect  the  efficiency  of  probe  binding  as  well  as  sensitivity 
(Amann  and  Fuchs,  2008). 

Sensitivity:  Several  factors  affect  FISH  sensitivity.  First,  FISH  probes  target 
ribosomal  RNAs,  which  are  an  active  component  of  the  ribosome,  the  cell’s 
protein  assembly  machinery.  The  number  of  ribosomes  in  any  given  cell  can  vary 
significantly,  from  tens  to  tens  of  thousands  of  copies  per  cell.  Small  cells  tend  to 
have  fewer  ribosomes  per  cell  (Kemp  et  al.,  1993,  Morris  et  al.,  2002),  and  poor 
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growth  conditions  can  result  in  decreased  ribosomal  content  (Kemp  et  al.,  1998). 
The  ability  to  detect  specific  organisms  is  highly  dependent  on  the  ribosome  copy 
number,  and  thus  for  some  environmental  applications  targeting  small  and/or 
nutrient-deprived  cells,  sensitivity  may  be  a  challenge. 

In  addition,  the  probe  and  fluorophore  selection  can  affect  sensitivity.  Not  all 
dyes  are  equally  bright  (e.g.,  Alexa  Fluor  555  outperforms  Cy3),  and  probe 
alterations  can  also  help  amplify  signal.  Multiple  probes  targeting  the  same  taxon 
at  different  regions  of  its  rRNA  (e.g.  for  SAR1 1  -  Morris  et  al.,  2002),  or  multiply- 
labeling  single  probes  (Pernthaler  et  al.,  2002),  can  both  be  effective  strategies. 

In  the  latter  case,  the  accommodation  of  additional  dye  molecules  requires  more 
expensive  polynucleotide  probes  instead  of  oligonucleotides,  which  consequently 
constrains  target  resolution  to  higher-level  taxonomic  levels  (Amann  and  Fuchs, 
2008).  An  additional  option  for  increasing  probe  sensitivity  is  CARD  (CAtalyzed 
Reporter  Deposition)  -FISH,  which  links  the  horseradish  peroxidase  (HP)  enzyme 
to  the  oligo  probe  instead  of  a  fluorophore.  This  enzyme  then  converts 
fluorescently-labeled  tyramide  molecules  in  the  reaction  mixture  into  their 
fluorescently-active  state,  in  which  they  also  bind  locally  to  the  cell,  allowing 
signal  amplification  up  to  20  times  the  intensity  seen  with  monolabel  FISH  probes 
(Schdnhuber  et  al.,  1997,  Pernthaler  et  al.,  2002).  The  enzyme  molecule  is  40-80 
times  larger  than  a  fluorophore,  however,  and  thus  good  entry  of  the  HP-oligo 
conjugate  into  the  cell  can  be  challenging  and  requires  good  permeabilization 
(e.g.  initial  embedding  of  the  cells  in  agarose  followed  by  stronger  chemical 
treatment  than  in  standard  permeabilization  protocols).  The  bulkiness  of  the 
CARD-FISH  probe  is  particularly  problematic  when  evaluating  densely- 
associated  cells  (Schonhuber  et  al.,  1997),  and  in  these  cases  becomes  less 
sensitive  than  classical  FISH.  However,  in  planktonic  cells,  the  increased 
sensitivity  of  CARD-FISH  has  allowed  the  simultaneous  labeling  of  both 
ribosome  and  mRNA  from  particular  genes  of  interest  (Pernthaler  and  Amann, 
2004).  In  this  manner,  one  can  simultaneously  assay  for  “who  is  present”  and 
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“are  they  active”,  within  a  complex  microbial  community.  As  sensitivity  increases, 
FISH  has  moved  from  detecting  high-copy  ribosomal  RNAs  to  mRNAs,  and  has 
achieved  hybridization  and  visualization  to  single-copy  genes  in  the  genome 
(RING-FISH,  Zwirglmaier  et  al.,  2004;  RCA-assisted  FISH,  Smolina  et  al.,  2007, 
Marayuma  etal.,  2005). 

Lastly,  rather  than  increasing  the  absolute  signal,  sensitivity  can  be 
improved  by  reducing  the  noise  and  background  fluorescence.  Sample  treatment 
plays  an  important  role  in  this,  since  humics  and  other  contaminants  can 
autofluoresce  and/or  interfere  with  probe  binding.  In  addition,  some  probe 
structural  design  modifications  can  decrease  background  fluorescence  caused 
by  non-specific  binding  (e.g.  “molecular  beacon”  probes,  which  improve  signal-to- 
noise  ratios  and  thus  sensitivity,  e.g.  Lenaerts  et  al.,  2007). 

Specificity:  As  with  other  DNA-based  hybridization  methods,  specificity  is  an 
important  consideration  in  FISH.  Recent  general  modifications  to  probe  structure 
(e.g.  peptide  nucleic  acid  probes,  described  in  Bottari  et  al.,  2006,  Amann  and 
Fuchs,  2008)  offer  increased  binding  affinity  and/or  stability,  which  can  increase 
specificity  (and  sensitivity).  As  with  all  nucleic-acid  probes,  FISH  probe  specificity 
relies  on  the  size  and  quality  of  the  reference  database  used  to  design  them. 
Many  commonly-used  domain  and  group-specific  probes  were  designed  when 
the  16S  databases  were  significantly  smaller,  and  thus  have  an  expanding  false¬ 
negative  and  -positive  rate  (for  an  updated  analysis  of  common  FISH  probes 
specificity  see  Amann  and  Fuchs,  2008).  Ideally,  probes  should  be  re-evaluated 
before  use  using  current  databases  and  tools  (e.g.  probeBase,  the  online 
database  and  toolkit  for  designing  16S-rRNA  probes  and  primers,  Loy  et  al., 
2007).  Probe  design  and  optimization  are  not  restricted  to  cultivated  clades, 
since  not  only  do  many  clades  have  sufficient  database  representation  due  to 
16S  environmental  surveys,  but  such  environmental  sequences  can  be  cloned 
and  then  used  within  their  heterologous  host  for  optimization  of  hybridization 
conditions  (clone-FISH,  Schramm  et  al.,  2002).  Finally,  in  order  to  add 
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confidence  to  the  interpretation  of  FISH  specificity,  multiple  hierarchic  probes  can 
be  used  simultaneously  as  internal  cross-validation  (Amann  and  Fuchs,  2008). 

From  a  practical  stand-point,  as  a  community  profiling  method  FISH  is 
inexpensive  once  set  up  (i.e.  requires  a  good  fluorescence  microscope  but  not 
exorbitant  reagents),  and  is  ideal  for  research  questions  that  target  a  small 
number  of  monophyletic  groups  with  good  cell  permeability,  particularly  where 
spatial  structure  and  cell  size  observations  are  relevant.  Sample  preparation 
requires  a  moderate  to  advanced  level  of  technical  expertise  in  order  to  obtain 
good  results  (fixation,  permeabilization  and  hybridization  protocols  all  need  to  be 
tuned  for  different  communities),  and  takes  hours  to  days  depending  on  the 
sample.  Once  methods  are  optimized,  FISH’s  limit  of  detection  can  reach  -0.1% 
of  the  community  (Amann  and  Fuchs,  2008,  Woebken  et  al.,  2007).  Although 
sample  scanning  and  analysis  are  becoming  more  automated  (Daims  et  al., 

2006;  Alonso  and  Pernthaler,  2005;  Cottrell  and  Kirchman,  2003),  FISH  is  not  a 
high-throughput  method.  In  summary,  FISH  is  an  important  community  profiling 
tool  when  examining  division-level  community  structure  (using  previously 
developed  probes)  or  focusing  in  on  a  limited  number  of  specific  microbial  groups 
(the  number  of  different  species  simultaneously  resolvable  is  currently  around 
seven,  Amann  and  Fuchs,  2008).  However,  given  the  complexity  of  natural 
microbial  communities  and  time  required  for  probe  development,  FISH  is  not 
meant  to  comprehensively  profile  more  than  a  small  fraction  of  the  microbial 
community  at  a  refined  taxonomic  scale.  Profiling  at  higher  taxonomic  levels  - 
e.g.  the  level  of  alpha-,  beta-,  or  gamma-  proteobacteria  -  can  miss  significant 
differences  at  lower  taxonomic  resolution.  Also,  the  application  of  FISH  cannot 
easily  be  standardized  to  a  variety  of  samples,  due  to  differences  in  fixation  and 
hybridization. 


Flow  cytometry  and  fluorescentiv-activated  cell-sorting 
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Flow  cytometry  (FCM)  and  fluorescence-activated  cell  sorting  (FACS)  use 
optical  properties  of  a  sampled  population  to  enumerate  and/or  separate  different 
optical  grouping  of  cells.  Basic  FCM  relies  solely  upon  the  inherent  properties  of 
the  cells  themselves.  For  example,  FCM  “signatures”  result  from  the  combined 
effects  of  cell  size,  shape,  and  internal  structure  on  light  scatter,  and  of  the 
characteristic  autofluorescence  of  naturally-occurring  cell  pigments  (e.g., 
chlorophyll  or  phycoerythrin).  While  FCM  simply  counts  and  records  cells  based 
on  these  properties,  FACS  uses  the  same  properties  to  sort  different  populations 
of  cells. 

A  major  focus  of  flow  cytometric  studies  has  been  photosynthetic  members 
of  communities  (for  a  recent  discussion  of  phytoplankton  flow  cytometry  see 
Dubelaar  et  a!.,  2007)  because  of  their  ease  of  detection  due  to  pigment 
autofluorescence.  Many  photosynthetic  clades  can  be  distinguished  based  on 
differential  pigment  composition  and  cell  size.  A  notable  example  is  the  FCM- 
signature-based  discrimination  of  the  ocean  cyanobacteria  Proch/orococcus  from 
its  co-occurring  sister  group,  Synechococcus,  by  photosynthetic  pigments  (divinyl 
chlorophyll  a  vs  chlorophyll  a  and  phycoerythrin)  and  cell  size,  and  from  larger 
co-occurring  picoeukaryotic  phytoplankton  (Chisholm  eta/.,  1988,  Waterbury  et 
at.  1984).  Even  without  the  additional  probe-  and  dye-based  discriminatory 
abilities  of  flow  cytometry,  its  suitability  for  profiling  photosynthetic  microbes 
ensures  its  value  as  a  tool  particularly  for  aquatic  microbial  ecologists,  and  it  has 
been  used  widely  and  successfully  in  a  number  of  studies  (e.g.  Seymour  et  at., 
2005,  Johnson  and  Zinser  eta!.,  2006,  Mary  et  at.,  2006,  Thyssen  et  at.,  2008). 

FCM  profiling  of  non-photosynthetic  groups  is  more  challenging  because 
they  possess  less  native  FCM-signature  information.  Without  fluorescence,  small 
cells  are  difficult  to  image  based  solely  on  light  scatter  unless  significant 
instrument  modifications  are  made  for  small-particle  detection.  As  a  result, 
methods  developed  for  epifluorescent  microscopy  have  been  transferred  to  FCM 
applications,  for  example  DNA  stains  (e.g.,  Hoescht,  DAPI,  SYBR)  may  be  used 


18 


to  intercalate  with  cellular  DNA  and  cause  cells  to  fluoresce  (when  excited  at  the 
correct  wavelength).  Such  bulk  DNA  staining  can  allow  total  microbial  counts 
(e.g.  in  Kuypers  et  a!.,  2005),  the  delineation  of  gross  miaobial  groups  (e.g.,  high 
and  low-DNA  cells,  Gasol  et  al.,  1999),  and  the  examination  of  cell  physiology 
(e.g.  LIVE/DEAD  stains,  Berney  et  al.,  2007),  it  does  not  allow  much  resolution 
for  community  profiling. 

Rather  than  bulk  staining,  clade-specific  FISH  tagging  may  be  combined 
with  FCM  to  enumerate  particular  phylogenetic  groups  from  a  community.  FISH- 
FACS  has  begun  to  allow  the  enrichment  of  target  populations  from  complex 
communities  for  further  analysis.  Although  absolute  separation  of  probe-targeted 
cell  types  has  not  yet  been  obtained,  enrichment  for  targeted  cells  has  been 
successful  (e.g.  type  I  and  II  methanotrophs  enriched  from  4.7%  to  50%  and 
1 .2%  to  47.5%,  respectively,  Kalyuzhnaya  et  al.,  2006;  but  only  ~2-fold  for 
targeted  CFB  and  no  enrichment  for  targeted  3-proteobacteria  in  Sekar  et  al., 
2004).  Combining  FCM  with  FISH  involves  many  of  the  limitations  and 
challenges  associated  with  FISH  (especially  the  negative  effects  of  contaminants 
not  removed  during  sample  preparation),  while  removing  the  spatial  structural 
observational  power  of  FISH,  but  it  does  allow  the  enumeration  of  the  whole  cells 
of  targeted  groups  in  a  high-throughput  manner. 

In  addition,  advances  in  FCM,  such  as  equipment  miniaturization,  are 
bringing  down  costs  and  enabling  novel  field  deployments  (Gruden  et  al.,  2004, 
Yang  et  al.,  2006).  Perhaps  pinnacle  among  these  field  efforts  are  the 
“FlowCytoBot"  (Olson  and  Sosik,  2007;  Sosik  and  Olson,  2007)  and  “Cytosub” 
(Thyssen  et  al.,  2008)  which  use  robotics  and  autonomous  sampling  devices  to 
enable  in  situ  real-time  FCM.  The  FlowCytoBot  can  be  deployed  for  more  than  a 
month,  and  performs  both  in  situ  flow  cytometry  and  automated  image-based 
taxonomic  identification  of  larger  cells. 

Overall,  as  a  profiling  method,  FCM  and  FACS  are  relatively  quick. 
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reproducible,  and  inexpensive  (once  a  flow  cytometer/sorter  is  available; 
although  most  core  facilities  have  them  available  because  of  their  use  in  medical 
research).  For  community  profiling,  however,  the  approach  has  not  been  well 
developed  for  standard  and  comprehensive  surveys.  Depending  on  the 
population  being  targeted  and  the  identification  method,  FCM  can  be  a  relatively 
straightforward  process  or  a  technologically  very  sophisticated  one  that  requires 
substantial  expertise  and  validation. 

Nucleic  acid-based  profiling 

Since  Pace  et  al.  (1985)  pioneered  the  use  of  ribosomal  rRNA  gene  as  a 
genetic  marker  for  studying  the  diversity  of  microbes  in  natural  systems,  this  new 
window  has  provided  unprecedented  views  of  the  “uncultured  majority”  and 
vastly  expanded  our  understanding  of  microbial  diversity.  The  remainder  of  this 
chapter  is  devoted  to  reviewing  the  major  culture-independent  molecular 
methods  used  to  profile  microbial  communities.  These  methods  fall  into  two 
fundamental  classes:  those  that  rely  upon  amplifying  a  single  target  gene  from 
community  DNA  or  rRNA,  and  use  this  gene  to  “fingerprint”  the  community 
through  any  of  a  number  of  methods,  and  those  that  directly  examine  the 
community  DNA  without  gene-specific  amplification  (i.e.  metagenomics). 

Single-gene  surveys 

One  way  to  profile  a  community  is  by  surveying  a  phylogenetically- 
informative  marker  gene  within  that  community.  Commonly-used  phylogenetic 
markers  include  genes  involved  in  translation  (16S  rRNA,  23S  rRNA,  the  internal 
transcribed  spacer  (ITS)  region  between  the  two,  and  ribosomal  proteins), 
transcription,  (e.g.  transcription  factors,  and  RNA  polymerases  component  genes 
such  as  rpoB),  and  DNA  replication  and  repair  (DNA  pol,  recA),  and  other  core 
cellular  functions  (e.g.  the  chaperone  gene  dnaK),  as  well  as  some  functional 
genes  that  are  considered  phylogenetically  robust  (i.e.,  show  little  or  no  evidence 
of  horizontal  gene  transfer),  e.g.  the  pmoA  gene  of  methanotrophs  (McDonald  et 
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ai,  2008). 


Caveats  of  DNA  amplification 

Single-gene  investigations  of  community  profiles  use  the  polymerase 
chain  reaction  (PCR)  to  amplify  the  target  gene  from  environmental  DNA. 
However,  this  amplification  has  several  caveats.  First,  PCR  reactions  can  create 
both  errors  (heteroduplex  and  chimeric  products,  as  well  as  polymerase  error) 
and  biases  (skewing  of  the  relative  proportions  of  sequence  types).  Both 
significantly  effect  downstream  diversity  estimates  (Acinas  et  ai,  2005, 

Thompson  et  ai,  2002,  Polz  and  Cavanaugh  1998).  Second,  given  the 
complexity  of  unknown  microbial  communities  in  the  wild,  primer  specificity  (as 
described  above)  may  not  be  uncertain,  i.e.  primer  sets  may  not  amplify  as 
comprehensively  as  they’re  assumed  to.  For  example,  “universar-primer-based 
surveys  missed  an  entire  high-level  taxonomic  group,  the  kingdom 
Nanoarchaeota  (Huber  et  ai,  2002).  Similarly,  other  primer  sets  continue  to  miss 
lineages  (e.g.  among  the  archaea,  Teske  and  Sorensen,  2007,  and  among  the 
planctomycetes,  Derakshani  etai,  2001,  Kohler  et  ai,  2008). 

Several  methodological  adjustments  have  been  suggested  to  ameliorate 
the  problems  of  specific-primer-directed  PCR.  First,  the  number  of  amplification 
cycles  should  be  kept  to  a  minimum  to  decrease  bias  (Suzuki  and  Giovannoni, 
1996)  and  to  minimize  chimera  formation  and  polymerase  errors  (Acinas  et  ai, 
2005).  Second,  pooling  replicate  PCR  amplification  reactions  helps  compensate 
for  early-cycle  drift  that  leads  to  bias  (Acinas  et  ai,  2005,  Polz  and  Cavanaugh, 
1998).  Third,  reactions  should  be  ramped  as  quickly  as  possible  from  denaturing 
to  annealing  temperatures  (Acinas  et  ai,  2005).  Fourth,  a  “reconditioning” 
approach  minimizes  heteroduplex  and  chimera  formation,  by  employing  a  second 
low-cycle  amplification  using  a  dilution  of  the  first  amplification  with  excess  primer 
(Thompson  et  ai,  2002).  Finally,  as  discussed  above  for  FISH  probes,  it  is 
important  that  primers  (particularly  at  the  domain  level)  be  continuously 
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reassessed  in  light  of  the  ever-expanding  size  of  environmental  sequence 
databases,  and  several  combinations  of  primers  may  be  used  if  the  goal  is  to 
maximize  the  diversity  of  sequences  recovered 

Once  amplified,  phylogenetic  marker  gene  amplicons  can  be  used  for 
community  fingerprinting  methods  or  sequenced. 

Quantification  during  amplification 

Quantitative  PCR  (qPCR)  is  widely  used  in  microbial  ecology  as  a  means  of 
directly  enumerating  the  abundance  of  specific  clades  (via  conserved  sequences 
common  to  all  clade  members)  in  environmental  samples.  To  date,  many  of 
these  studies  have  examined  specific  food  contaminants  and  pathogens,  but 
qPCR  has  played  a  role  in  microbial  ecology  as  well  (reviewed  in  Zhang  and 
Fang,  2006).  Three  studies  of  particular  note  apply  such  methods  to  ocean 
systems,  and  provide  examples  of  the  range  of  target  genes  used.  First,  Suzuki 
et  at.  (2000)  was  one  of  the  first  marine  qPCR  studies  of  the  marine 
picoplankton,  and  in  surveys  of  Prochlorococcus,  Synechococcus,  and  Archaea 
in  Monterey  Bay  showed  good  concordance  between  16S-targeted  qPCR  and 
other  methods.  Second,  Johnson  and  Zinser  et  al.  (2006)  targeted  the  ITS  region 
by  qPCR  to  describe  Prochlorococcus  ecotype  abundance  across  a  basin-wide 
transect  in  the  Atlantic  Qcean.  Remarkably,  the  qPCR  results  accounted  for  most 
of  the  FCM-detected  Prochlorococcus  populations  in  these  ocean  samples.  A 
third  study  used  qPCR  of  a  functional  marker  gene  (amoA)  to  quantify  the 
relative  abundance  of  putative  archaeal  and  bacterial  nitrifiers  in  Monterey  Bay 
and  the  North  Pacific  Subtropical  Gyre  (Mincer  et  al.  2007).  qPCR  is  highly 
reproducible  and  sensitive,  relatively  inexpensive,  high  throughput,  and  is  a 
valuable  tool  for  profiling  particular  community  members.  Significant  primer 
optimization  is  required,  and  robust  primer  design  relies  on  comprehensive 
environmental  sequence  information  to  ensure  that  the  breadth  of  desired  native 
diversity  is  targeted.  In  addition,  single  to  several  taxa  can  be  profiled  using 
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qPCR,  but  because  of  design  and  optimization  issues  and  finite  amounts  of 
sample  DNA  this  method  is  not  practical  for  profiling  many  clades  or  taxa 
simultaneously. 

Fingerprinting  of  amplified  genes 

Most  profiling  studies  examine  amplified  phylogenetic  marker  gene  using 
one  or  more  fingerprinting  method.  These  methods  may  be  used  as  an  end  in 
themselves  or  as  the  first  or  complementary  step  in  an  investigation. 
Fingerprinting  methods  separate  amplicons  based  on  sequence  differences  by 
assaying  some  consequence  of  those  sequence  differences  -  e.g.,  restriction 
sites,  denaturation,  or  amplicon  length.  In  general,  these  methods  are  useful  for 
detecting  structural  changes  in  community  composition  between  samples. 
Depending  on  the  method  and  primer  sets,  these  techniques  may  be  executed 
along  a  scale  from  coarse  to  fine  phylogenetic  resolution.  While  inexpensive  and 
relatively  quick,  these  methods  require  substantial  development  and  validation  to 
connect  particular  fingerprints  with  particular  clades.  Furthermore,  the  dynamic 
range  is  often  limited;  beyond  a  narrow  range,  measurement  of  abundance 
differences  between  samples  may  be  qualitative  rather  than  quantitative.  Many  of 
the  methods  described  below  are  applied  to  a  variety  of  genes,  although  some 
are  specific  to  the  16S  rRNA  gene  or  operon,  and  for  profiling  purposes  the  16S 
gene  remains  the  most  common  target. 

Fingerprinting  by  differences  in  denaturation  and  annealing 

DGGE  &  TGGE:  Denaturing  or  temperature  gradient  gel  electrophoresis. 
DGGE  &  TGGE  (reviews  in  Muyzer  and  Smalla,  1998,  and  Mocker  et  al.,  2007, 
respectively)  separate  small  (generally  less  than  800  base  pairs  (bp))  PCR 
products  by  dissociation  differences  caused  by  sequence  heterogeneity. 
Amplicons  are  electrophoretically  separated  on  an  acrylamide  gel  that  contains  a 
parallel  denaturing  gradient,  generated  either  chemically  (DGGE  e.g.  by  urea- 
formaldehyde)  or  with  temperature  (TGGE).  As  amplicons  move  through  the  gel 
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towards  the  cathode,  they  enter  increasingly  denaturing  conditions,  causing 
bubbles  of  denatured  DNA  to  form  in  the  double-stranded  amplicons.  This  greatly 
increases  the  molecules’  surface  area,  which  significantly  retards  their  movement 
through  the  gel.  Sequence  heterogeneity  among  the  amplicons  results  in 
difference  points  of  denaturation,  and  thus  different  patterns  of  migration.  Some 
DGGE  &  TGGE  protocols  incorporate  a  GC-clamp  (usually  ~  40bp)  on  one 
primer  in  the  PCR  step,  which  will  then  act  to  hold  a  mostly-denatured  amplicon 
together  at  one  end  as  it  moves  through  the  gel  (since  G-C  bonds  are  more 
stable  than  A-T  ones,  so  a  long  G-C  string  will  be  slow  to  denature)  and  thus 
prevent  or  delay  complete  dissociation,  which  would  complicate  banding  and 
interpretation.  However,  GC  clamps  can  cause  decreased  PCR  efficiency  and 
increased  likelihood  of  artifacts,  and  are  not  necessary  if  gentler  denaturing 
gradients  are  used  (Nocker  et  al.,  2007). 

Optimal  denaturing  conditions  are  first  identified  by  running  amplicons  in 
multiple  wells  through  a  denaturing  gradient  perpendicular  to  electrophoresis, 
and  then  the  optimized  range  can  be  used  parallel  to  electrophoresis  to  generate 
the  DGGE  banding.  Bands  may  be  visualized  by  staining  with  DNA  dyes,  or 
primers  can  be  fluorescently  label  to  improve  visualization  sensitivity  of 
denatured  amplicons  (e.g.  10-fold,  Moeseneder  et  al.,  1999).  As  with  all 
fingerprinting  methods,  the  phylogenetic  specificity  of  DGGE  and  TGGE  depends 
on  the  specificity  of  the  primers  used  for  amplification,  although  the  methods  are 
inherently  limited  in  resolution  due  to  their  gel  separation  and  visualization  steps. 
They  are  generally  used  to  identify  bulk  shifts  in  community  composition, 
although  they  can  be  used  effectively  to  track  specific  changes  in  simpler 
communities  (Nocker  et  al.,  2007),  and  there  is  continual  refinement  of  group- 
specific  DGGE  primer  sets  (e.g.  updated  marine  bacterial  clade  primers,  Muhling 
et  al.,  2008).  Under  typical  conditions,  the  limit  of  detection  of  DGGE  and  TGGE 
is  target  groups  that  represent  at  least  -1-2%  of  the  community  (Nocker  et  al., 
2007).  The  method  is  inherently  quite  sensitive  to  sequence  variations  among 
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amplicons,  and  bands  can  be  excised  and  sequenced  to  link  bands  to  particular 
genotypes.  However,  as  with  many  fingerprinting  methods,  one  band  or 
fingerprint  type  cannot  be  assumed  to  derive  from  a  single  phylogenetically 
coherent  sequence  clade.  In  addition,  tailoring  denaturing  conditions  to  the 
particular  amplicons  under  study  can  be  time-consuming,  and  it  can  be  difficult  to 
compare  data  robustly  between  runs  and  labs  because  of  gel  variability;  though 
see  “Additional  Considerations”  below. 

DHPLC:  Denaturing  high-performance  liquid  chromatography.  PCR 
products  are  separated  using  temperature  and  chemical  denaturation.  Products 
loaded  onto  an  HPLC-cartridge  in  a  solution  of  acetonitrile  and  triethylammonium 
acetate,  TEAA.  The  TEAA  converts  to  TEA+,  an  amphiphilic  molecule,  which 
interacts  with  the  DNA  at  its  charged  end,  and  the  HPLC  cartridge  material  at  its 
nonpolar  end,  thereby  tethering  the  amplicons  in  place.  The  amplicons  are 
sequentially  eluted  from  the  cartridge  by  increasing  the  temperature  and 
acetonitrile  concentration.  Elutants  are  quantified  by  a  UV  detector  that  measures 
absorbance  at  260nm,  or  amplicons  can  be  fluorescently  labeled  via  their  PCR 
primers.  A  fraction  collector  can  be  joined  to  the  HPLC  to  collect  eluted 
fragments  for  further  characterization.  DHPLC  does  best  at  separating  smaller 
fragments  below  about  500bp,  but  can  work  on  larger  e.g.  1500bp  molecules  at 
decreased  sensitivity.  Separation  parameters  need  to  be  carefully  tailored  for  the 
targets  of  interest,  and  this  can  be  extremely  time-consuming. 

CDCE;  Constant  denaturant  capillary  electrophoresis.  CDCE  was 
developed  for,  and  remains  primarily  used  for,  genetic  screening  of  mutations 
(Khrapko  et  al.,  1994),  but  has  been  applied  to  community  profiling  in  microbial 
ecology  (Thompson  et  al.,  2004).  PCR  products  are  loaded  onto  a 
polyacrylamide  capillary  with  constantly  denaturing  conditions  (chemical-  or 
temperature-based).  Amplicons  denature  dynamically  to  differing  degrees  and 
rates  based  on  their  sequences,  causing  them  to  travel  at  different  speeds,  and 
elute  at  different  times  from  the  capillary.  By  tagging  PCR  primers  with  a 
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fluorophore,  product  elution  can  be  measured  by  laser  detection.  CDCE  has  high 
sensitivity  to  single  base  pair  differences,  and  there  is  a  quantitative  relationship 
between  fluorescence  intensity  of  eluted  fragments  and  their  relative  abundance. 
Eluted  amplicons  can  be  collected  for  further  analysis. 

SSCP:  Single  strand  conformation  polymorphism.  PCR  amplicons  are 
uniformly  denatured  first,  and  then  are  run  on  an  acrylamide  gel  or  on  a  capillary 
sequencer  as  single-stranded  molecules.  The  ssDNA  molecules  folds  back  on 
themselves  creating  internal  secondary  structure.  This  secondary  structure  will 
result  in  differential  migration  through  a  matrix,  and  allow  for  separation.  A 
challenge  of  SSCP  is  the  difficulty  of  keeping  ssDNA  from  re-annealing  to  its 
complement  during  gel  loading  and  running.  However,  one  of  the  two  primers 
can  be  tagged  with  a  5’  phosphate  group,  which  allows  selective  targeting  of  one 
of  the  two  strands  by  lambda  exonuclease  (Mocker  et  al.,  2007;  Schwieger  and 
Tebbe,  1998).  Interpretation  of  results  can  be  complicated  by  the  fact  that  single 
ssDNA  sequences  can  fold  in  multiple  ways,  which  if  they  have  similar  energetic 
favorability  can  result  in  a  single  sequence  type  being  represented  by  several 
bands. 

Fingerprinting  by  amplicon  restriction  site  heterogeneity 

ARDRA:  Amplified  ribosomal  DMA  restriction  analysis.  The  16S  rRNA 
genes  are  PCR-amplified  from  bulk  community  DMA,  and  cloned.  Clones  are 
then  restriction  digested  and  the  fragments  are  separated  on  a  gel,  visualized  by 
gel  staining  with  a  DNA  dye  (e.g.  ethidium  bromide),  and  the  banding  pattern  is 
interpreted  as  a  low-resolution  proxy  for  phylogeny.  Multiple  restriction  enzymes 
are  required  in  order  to  differentiate  among  lineages  (Moyer  et  at.,  1996),  and 
enzyme  choice  has  a  significant  effect  on  the  scale  of  resolution  (Zeng  et  at., 
2007).  Generally,  clones  are  also  sequenced  to  validate  interpretation  of  banding 
patterns.  Using  higher  taxonomic-level  16S  rRNA  gene  primers  for  amplification 
leads  to  in  low  resolution  sample  comparisons,  while  more  taxonomically-specific 
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primers  can  be  used  to  investigate  particular  clades  of  interest,  as  ARDRA  is 
able  to  discriminate  at  the  species  level  (e.g.  among  bioluminescent  marine 
bacteria,  Kita-Tsukamoto  et  al.,  2006).  ARDRA  is  inexpensive  and  technically 
straightforward,  however  has  a  relatively  low  sensitivity  and  variable  resolution, 
and  as  with  all  gel-image-based  methods,  comparing  profiles  confidently  over 
time  or  between  labs  can  be  challenging.  It  is  used  for  overall  community  profiling 
(e.g.  Polz  et  al.,  1999),  examining  specific  clades  of  interest  within  communities 
(e.g.  Kita-Tsukamoto  etal.,  2006),  or  directing  investigations  of  16S  clone 
libraries  in  order  to  optimize  the  cost-benefit  between  clone  sequencing  and 
adequate  description  of  community  diversity  (e.g.  Sun  et  al.,  2008),  or  for  indirect 
community  profiling  by  distinguishing  among  isolates  (e.g.  Michel  et  al.,  2007). 

TRFLP:  Terminal  restriction  fragment  length  polymorphism.  TRFLP 
(discussed  in  Hartmann  and  Widmer,  2008)  begins  with  gene-specific  PCR 
amplification  using  a  fluorophore-conjugated  primer.  Amplicons  are  restriction 
digested,  and  fragments  are  separated  by  size  using  capillary  electrophoresis, 
and  visualized  in  Genescan  mode.  Restriction  site  distribution  can  be  a 
phylogenetically  informative  character,  and  methods  are  tailored  by  initial  in  silico 
experiments  using  existing  databases  (e.g.  testing  the  specificity  of  a  particular 
primer  set  /restriction  enzyme  combination  against  all  of  RDP,  Marsh  et  al.,  2000, 
Kent  et  al.,  2003).  As  with  ARDRA,  the  use  of  multiple  enzymes  can  help  refine 
resolution  of  interpretation,  and  data  analysis  must  be  done  carefully  (Osborne  et 
al.,  2006).  Caveats  include  incomplete  restriction  digestion,  and  a  slowing  of 
fragment  migration  due  to  unwieldy  dye  molecules  (although  not  all  dye 
molecules  have  a  significant  effect  on  mobility)  (Mocker  et  al.,  2007).  TRFLP  is 
typically  more  sensitive  than  DGGE  (e.g.  five  times  as  sensitive,  Tiedje  et  al., 
1999)  due  to  fluorophore-based  visualization  rather  than  gel  staining,  although  it 
may  be  less  sensitive  that  LH-PCR  (discussed  below)  because  of  partial 
restriction  digestion  (e.g.  in  an  ITS  analysis  using  tRFLP  and  LH-PCR,  LH-PCR 
was  both  more  sensitive  and  higher  resolution,  producing  more  distinct  bands 
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than  tRFLP,  Mills  et  ai,  2003).  TRFLP  analyses  of  a  number  of  genes  are 
common  in  microbial  ecology  studies  (e.g.  Bertilsson  et  al.,  2007,  Goffredi  et  al., 
2008,  Morris  et  al.,  2005,  and  Horz  et  al.,  2005,  Appendix  3)  because  of  the 
method’s  relative  ease  and  high  reproducibility. 

Fingerprinting  by  amplicon  length 

LH-PCR:  Length-heterogeneity-PCR.  LH-PCR  (reviewed  in  Mills  et  al., 

2007)  examines  a  subsection  of  the  16S  rRNA  gene,  spanning  some  subset  of 
its  variable  regions,  and  amplicons  (uncloned)  are  then  distinguished  based  on 
length  heterogeneity.  Since  relatively  small  amplicons  are  created  (generally 
several  hundred  base  pairs),  small  differences  in  length  can  be  resolved.  One 
primer  used  for  amplification  is  fluorescently  linked,  to  allow  relatively  precise 
size  assessment  of  amplicons  using  capillary  sequencers  in  Genescan  mode. 

The  area  under  the  Genescan  peak  is  used  as  a  metric  for  the  abundance  of  a 
particular  fragment  length  class.  LH-PCR  has  been  used  in  a  number  of 
community  profiling  studies  (e.g.  Suzuki  et  al.,  1998,  Ritchie  et  al.,  2000,  Mills  et 
al.,  2003,  Brusetti  et  al.,  2006,  Sekar  et  al.,  2006).  As  with  ARDRA,  primers  can 
be  taxonomically  tuned  to  allow  the  fingerprinting  method  to  focus  on  particular 
clades  (e.g.  LH-PCR  targeted  to  bovine  gut  commensals,  Bernhard  and  Field 
2000).  LH-PCR  can  also  be  used  on  intergenic  regions  of  other  operons,  and  has 
been  for  the  amoC-amoA  operon  (Norton  et  al.,  2002). 

ARISA:  Automated  ribosomal  intergenic  spacer  analysis.  Rather  than 
fingerprinting  the  16S  rRNA  gene,  ARISA  uses  the  amplified,  uncloned  intergenic 
transcribed  spacer  (ITS)  region  between  the  16S  and  23S  rRNA  genes  in  the  rrn 
operon.  The  ITS  can  be  amplified  from  across  broad  taxonomic  range  by  using 
conserved  primers  within  each  of  the  highly  conserved  flanking  gene,  though 
primers  can  be  tailored  to  specifically  target  clades  of  interest.  The  ITS  evolves  at 
a  faster  rate  than  the  rRNA  genes,  providing  a  finer  scale  of  phylogenetic 
resolution  and  allowing  discrimination  up  to  -98%  rRNA  sequence  identity 
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(Brown  et  al.,  2005).  Initially,  ARISA  amplicons  were  separated  on 
polyacrylamide  gels  and  visualized  with  silver  staining,  but  later  incorporated 
fluorescently-labeled  primer-based  visualization  using  capillary  system  (Fisher 
and  Triplett,  1999).  The  amplicons  generated  using  ARISA  are  generally  longer 
(typically  over  a  thousand  bases)  so  the  resolution  of  length  may  not  be  as 
precise  as  in  LH-PCR  (depending  on  the  primers  used  in  LH-PCR),  although  a 
larger  variety  of  fragments  is  produced,  potentially  allowing  finer-scale  resolution. 
Again,  Genescan  peak  area  is  used  as  a  proxy  for  the  abundance  of  the 
fragment  source  clade(s).  ARISA  has  been  used  for  a  number  of  high 
phylogenetic-resolution  community  profiling  studies,  some  with  large  numbers  of 
samples,  draw  sophisticated  correlations  between  detailed  community  structure 
and  environmental  parameters  (e.g.  from  oceanic  drifter  samples,  Hewson  et  al., 
2006a,  and  at  several  Microbial  Observatories:  Fuhrman  et  al.,  2006,  Hewson  et 
al.,  2006b,  Kent  eta!.,  2007). 

ITS-LH-PCR  is  a  variant  of  ARISA,  achieving  yet  a  finer  level  of 
phylogenetic  resolution.  It  involves  a  second,  parallel  step  of  restricting  ITS 
amplicons  by  targeting  the  tRNA-alanine  genes  that  commonly  occur  within  ITS 
regions  (Suzuki  et  al,  2004),  and  then  estimating  the  size  of  the  restricted 
fragment.  This  form  of  ITS-LH-PCR  provides  a  higher  degree  of  phylogenetic 
specificity  among  those  clades  with  a  tRNA-alanine  in  their  ITS,  as  compared  to 
standard  ITS-LH-PCR/ARISA.  ITS-LH-PCR  has  been  used  as  a  library-screening 
method  (Suzuki  et  al,  2004)  to  assess  community  diversity  captured  in  a  large- 
insert  clone  library. 

There  are  caveats  involved  in  the  interpretation  of  ITS  patterns  observed 
with  ARISA.  First,  not  all  species  have  linked  16S  and  23S  rRNA  genes  (e.g.  in 
Planctomycetes,  Liesack  and  Stackebrandt,  1989,  Menke  et  al.,  1991).  Second, 
many  organisms  have  multiple  linked  16  and  23S  rRNA  genes  in  their  genomes, 
the  copies  of  which  may  or  may  not  be  identical.  A  recent  study  of  155  fully- 
sequenced  bacterial  genomes  showed  the  average  number  of  rRNA  (rrn) 
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operons  per  genome  to  be  4.8  (range  2-15),  with  2.4  unique  ITS  length  variants 
per  genome  and  2.8  unique  ITS  sequence  variants  (Stewart  and  Cavanaugh, 
2007).  Thus,  although  gene  conversion  does  act  to  homogenize  multiple  rrn 
operons  in  a  genome,  substantial  heterogeneity  persists  (Stewart  and 
Cavanaugh,  2007).  Third,  there  appears  to  be  preferential  PCR-amplification  of 
shorter  ITS  templates  (Fisher  &  Triplett,  1999)  causing  the  relative  abundance  of 
variants  to  be  skewed  (beyond  other  potential  PCR  biases;  see  “Amplification” 
section). 

ARISA  is  relatively  easy,  and  like  LH-PCR  its  profiles  are  digitally  extracted, 
and  run  alongside  markers,  and  so  can  be  compared  easily  between  runs  and 
labs.  However,  it  is  “destructive”  fingerprinting  and  amplicons  cannot  be 
sequenced  after  they  are  measured. 

Fingerprinting  -  Additional  considerations:  Lineage  differences  in  target 
gene  copy  number,  and  intragenomic  diversity  of  multicopy  genes,  can  have 
potentially  confounding  effects  on  fingerprint-based  single-gene  community 
profiling.  For  example,  with  a  range  of  2-15  rrn  operons  per  genome  among  155 
bacterial  genomes  (Stewart  and  Cavanaugh,  2007),  16S-  or  23S-based  diversity 
could  overestimate  organismal  diversity  quite  considerably.  In  addition,  for  many 
of  above  fingerprinting  methods,  it  can  be  difficult  to  compare  between  labs  and 
environments.  And  for  all  of  the  above  methods,  it  is  not  always  straightforward 
to  convert  fingerprint  data  into  ecologically-meaningful  metrics.  A  recent  unified 
set  of  metrics  has  been  proposed  (Marzorati  et  al.,  2008),  developed  for  DGGE 
but  also  applicable  to  other  fingerprint  data,  and  summarized  here  as  being  of 
particular  interest.  This  conversion  of  fingerprints  to  environmental  metrics  has 
three  steps:  1.  Generation  of  a  range-weighted  richness,  Rr,  describing  the 
relative  complexity  of  the  fingerprint  given  the  degree  of  separation  applied;  in 
the  case  of  DGGE  this  would  involved  the  number  of  bands  (A/)  and  the 
denaturing  gradient  the  span  of  given  bands  {Dg,  e.g.  35%  to  40%  urea  and 
formamide),  such  that  Rr  =  {hf  x  Dg).  2.  Community  dynamics,  Dy,  where  profiles 
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are  available  for  the  same  community  over  time  (which  they  usually  are,  since 
they  are  used  to  identify  shifts  in  community  composition).  This  uses  a  moving 
window  analysis  to  look  at  the  Pearson  correlation  between  subsequent  time 
points,  and  calculates  the  consecutive  percent  change  of  the  community  at 
consecutive  times.  This  is  then  converted  into  a  Dy  value  for  that  community 
averaging  the  moving  window  percent  change  values  over  time.  Low  Dy  values 
represent  communities  that  do  not  change  quickly  or  substantially,  and  high  Dy 
values  indicate  highly  dynamic  communities.  3.  Functional  organization,  Fo,  is  a 
proxy  for  evenness  of  the  fingerprints  observed.  It  is  plotted  as  the  number  of 
unique  fingerprint  elements  observed  along  the  x  axis,  and  their  contributions  (by 
e.g.  intensity)  along  the  y  axis,  such  that  a  1 :1  line  would  represent  perfect 
evenness  and  skewing  from  that  line  indicates  the  relative  unevenness  of  a 
community.  The  authors  take  this  final  Fo  metric  a  step  further  to  tie  evenness  to 
community  resiliency,  but  regardless  of  the  merits  of  that  connection,  the 
potential  utility  of  these  basic  quantification  metrics  stands.  A  given  environment 
can  then  be  plotted  as  a  point  in  three  dimensions,  and  compared  to  other 
environments. 

Fingerprinting  summary:  The  important  unifying  caveats  of  these  methods 
are  that  single  observed  fingerprints  may  not  be  phylogenetically  coherent,  and 
that  different  fingerprints  using  a  given  method  probably  do  not  reflect  the  same 
level  of  phylogenetic  resolution.  In  addition,  since  most  techniques  are  not 
universal  in  their  coverage,  they  may  miss  significant  subsets  of  the  community. 
Also,  many  of  them  have  limited  dynamic  range,  which  may  preclude  accurate 
relative  abundances  comparisons  of  different  groups  among  samples. 
Fingerprinting  methods  are  ubiquitously  used  in  microbial  ecology  research  as 
they  provide  quick,  relatively  inexpensive  profiling  information.  However,  care 
must  be  taken  in  their  interpretation  and  validation.  Additional  overall  conclusions 
about  fingerprinting  methods  are  that  visualization  by  fluorophores  leads  to 
higher  sensitivity  than  by  gel  staining,  and  methods  that  capture  data  in  silico 
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(e.g.  using  Genescan  mode  of  capillary  sequencers)  offer  higher  resolution  and 
reproducibility. 

Sequencing  of  Amplified  Genes 

The  highest-resolution  way  to  distinguish  sequence  variants  among  PCR 
amplicons  is  to  sequence  them.  Because  these  community  PCR  amplicons  are 
heterogeneous,  they  must  be  first  cloned  in  order  to  separate  each  variant  and 
then  individually  sequenced.  Although  it  has  been  speculated  that  small-insert 
cloning  introduces  bias,  (e.g.,  small  sequences  clone  with  higher  efficiency,  end 
sequences  can  effect  modification  reactions  required  in  cloning,  and  expressed 
inserts  may  be  toxic  to  the  host),  one  recent  study  showed  good  congruence 
between  chemically  dissimilar  cloning  methods  and  concluded  that  in  general, 
little  bias  is  introduced  using  the  common  TOPO  TA  cloning  method  (in  a  ITS 
and  rRNA-gene  survey,  Taylor  etai,  2007).  Sequencing  16S  rRNAgene  clone 
libraries  remains  the  gold  standard  for  profiling  communities,  and  there  is  a 
wealth  of  information  about  interpreting  microbial  diversity  with  this  method  (e.g. 
Schloss,  2008,  Fierer  et  al.,  2007,  Janssen,  2006,  Bohannan  and  Hughes,  2003, 
Hughes  et  al.,  2001),  and  so  it  will  not  be  discussed  in  depth  here. 

New  high-throughput  sequencing  technologies,  e.g.,  “454”  or 
“pyrosequencing”  (Margulies  et  al.,  2005),  allow  direct  sequencing  without  the 
need  to  clone,  separating  templates  instead  using  microfluidics,  dilution,  bead¬ 
binding,  and  isolation  within  tiny  reaction  wells  for  sequencing  reactions.  This 
technique  therefore  avoids  potential  amplification  and  cloning  biases,  and  also 
generates  up  to  400,000  sequencing  reads  per  run.  Current  limitations  are  that 
sequencing  chemistry  is  expensive,  and  read  lengths  are  significantly  shorter 
than  traditional  Sanger  sequencing  (100-250  bases  versus  750+  bases).  These 
shortened  read-lengths  create  a  trade-off  between  robust  phylogenetic 
assignment  (e.g.  see  Krause  et  al.,  2008)  and  phylogenetic  discrimination. 
However,  some  analyses  indicate  that  short  reads  can  still  allow  profiling  via  the 
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16S  rRNA  gene  with  some  confidence,  at  varying  levels  of  resolution,  depending 
on  the  assignment  method  used  (e.g.  Liu  et  al.,  2007,  Sundquist  et  al.,  2007), 
particularly  when  variable  regions  within  the  16S  gene  are  targeted  (e.g.  Sogin  et 
al.,  2006,  Huber  et  al.,  2007,  Roesch  et  al.,  2007).  In  addition,  technology 
improvements  are  producing  ever-lengthening  reads  (e.g.,  500  bases  by  the  end 
of  2008  with  454  technology,  and  an  estimated  20kb  by  Pacific  Biosciences  in 
2010,  Korlach  ef  al.,  2008). 

DMA  microarravs: 

In  general  terms,  DMA  microarrays  are  a  hybridization  platform,  with  DNA 
probes  immobilized  on  a  substrate  (often  a  glass  slide),  used  to  query  a  mix  of 
nucleic  acids  for  complementary  sequences.  The  query  mix  is  labeled  with  a 
fluorophore  such  that  its  hybridization  of  to  the  immobilized  probes  can  be 
visualized.  A  single  array  can  contain  tens  of  thousands  of  probes,  deposited 
using  robotic  spotters  or  synthesized  in  place  using  photolithography.  Since  their 
development,  microarrays  have  been  used  primarily  in  gene  expression  studies 
to  compare  expression  across  different  cellular  types  or  conditions.  However  as 
the  technology  has  matured,  microarrays  have  been  applied  to  a  widening  range 
of  biological  questions,  including  microbial  ecology.  Microarrays  are  currently 
applied  in  microbial  ecology  to  assay  the  presence  and  relative  abundance  of 
particular  organisms  or  genes  (for  reviews  see  Lucchini  et  al.,  2001 ,  Zhou,  2003, 
Gentry  et  al.,  2006,  Wagner,  2007). 

There  are  two  broad  categories  of  microbial  microarrays.  The  first, 
representing  the  majority  of  microarrays,  target  particular  genes.  There  are  two 
types  of  gene-specific  arrays.  In  the  first,  arrays  target  putative  functional  guilds 
(e.g.  sulfate  reducers,  nitrogen  fixers,  etc.),  either  via  functional  genes  in  the 
pathway(s)  of  interest  or  16S  genes  in  cases  where  16S  identity  correlates  to 
conserved  metabolism  (Small  et  al.,  2001 ,  Wu  et  al.,  2001 ,  Cho  and  Tiedje, 

2002,  Koizumi  et  al.,  2002,  Loy  et  al.,  2002,  Bodrossy  et  al.,  2003,  Taroncher- 
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Oldenburg  etai,  2003,  Greene  and  Voordouw,  2003,  Stralis-Pavese  eta!.,  2004, 
Tiquia  et  al.,  2004,  Rhee  et  al.,  2004,  He  et  al.,  2007).  The  second  type  of  gene- 
specific  arrays  are  those  which  attempt  to  holistically  profile  a  community,  via  its 
16S  diversity  (e.g.  Wilson  et  al.,  2002;  Marcelino  et  al.,  2006;  Palmer  et  al.,  2006; 
Brodie  et  al.,  2007;  DeSantis  et  al.,  2007).  For  both  types  of  gene-specific 
microarrays,  PCR  amplification  of  the  targeted  gene  from  the  sample  is  typically 
(though  not  always)  performed  prior  to  or  as  part  of  the  labeling  reaction,  such 
that  labeled  amplicons  are  hybridized  to  the  arrays. 

The  second,  less  common  category  of  arrays  used  in  microbial  ecology 
studies  are  community  genome  arrays  (CGA)  which  use  entire  genomes  as 
probes  (e.g.  Wu  et  al.,  2004,  Wu  et  al.,  2006,  Bae  et  al.,  2005),  and  evolved  from 
earlier  lower-throughput  platforms  whose  use  was  dubbed  “reverse  sample 
genome  probing”  (RSGP;  reviewed  in  Greene  and  Voordouw,  2003).  Thus  far, 
such  arrays  have  relied  on  axenic  cultures  or  isolates  in  order  to  generate  the 
required  genome  probes.  However,  it  has  been  suggested  (Zhou,  2003;  Greene 
and  Voordouw,  2003)  that  environmental  genomic  surveys  and  large-insert  clone 
libraries  could  instead  be  used  to  identify  and  generate  genome  fragment  probes. 

It  is  in  this  context  that  the  genome-proxy  array  has  been  developed,  and 
is  described  in  this  thesis.  The  genome  proxy  array  is  a  hybrid  of  the  two  major 
categories  of  arrays  currently  used  in  microbial  ecology,  in  that  it  targets 
genomes  and  genome  fragments  through  many  individual  gene-specific 
oligonucleotide  probes.  In  many  respects  it  is  conceptually  more  like  a  multi¬ 
species  “comparative  genome  hybridization”  (CGH)  array.  The  multiple 
oligonucleotide  probe  design  allows  a  finer-scale  resolution  than  using  entire 
genome  fragments,  and  allows  related  cross-hybridizing  sequences  to  be 
distinguished  based  on  their  hybridization  patterns.  In  the  genome  proxy  array, 
sets  of  70-mer  oligonucleotide  probes  (generally  n=20  per  genotype)  were 
designed  to  different  genomes  and  genome  fragments  derived  from  microbial 
assemblages  found  in  the  ocean,  one  of  the  most  comprehensively  characterized 


34 


and  genomic  sampled  environments  on  earth.  The  majority  of  these  targets 
(roughly  two-thirds)  were  sequenced  from  in-house  large-insert  environmental 
genomic  libraries,  captured  from  the  same  sites  under  investigation.  This  array 
platform  is  described  further  in  Chapter  2. 

Microarrays  have  certain  inherent  limitations,  as  a  technology  based  on 
hybridization.  Microarrays  can  generally  only  provide  information  about  what  is 
represented  on  the  array,  or  closely  related  sequences  (although  see  Wang  et  at. 
2002),  and  are  thus  fundamentally  different  from  metagenomics  in  their  ability  to 
profile  communities,  and  are  more  akin  to  methods  like  FISH.  In  addition,  arrays 
provide  relative  rather  than  absolute  quantification  (although  correlations 
between  array  signal  and  absolute  abundance  can  be  strong).  The  strength  of 
microarray  profiling  is  in  the  simultaneous  tracking  of  many  distinct  organisms  or 
genes,  unlike  FISH,  FCM,  or  Q-PCR,  and  at  a  higher  and  more  reliable  level  of 
resolution  than  fingerprinting  methods. 

Community  Genomics,  aka  Metagenomics 

There  are  two  distinct  methods  for  obtaining  metagenomic  data  from  a 
community.  Environmental  DMA  can  be  cloned  into  small-  (e.g.  shotgun)  or  large- 
insert  libraries  (e.g.  fosmid  and  BAC),  and  some  or  all  of  the  clones  can  be 
sequenced,  either  in  a  random  or  a  targeted  (i.e.  based  on  screening  of  the 
library  for  clones  of  interest)  approach.  Alternatively,  new  methods  have  allowed 
environmental  DMA  to  be  sequenced  directly  without  cloning. 

Environmental  genomic  libraries:  Small-insert  environmental  genomic  clone 
libraries(or  shotgun  clone  libraries)  have  been  used  in  a  number  of  different 
environments  to  capture  small  genomic  DMA  fragments  from  the  numerically 
dominant  members  of  natural  microbial  assemblages  (e.g.  Tyson  etal.,  2004, 
Venter  et  ai,  2004,  Tringe  etal,  2005,  Strous  etal.,  2006,  Rusch  et  al.,  2007, 
Yooseph  et  al.,  2007,  Kurokawa  et  al.,  2007).  The  relative  success  of  small-insert 
metagenomic  studies  is  directly  related  to  the  complexity  of  the  community,  the 
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amount  of  sequence  obtained,  and  the  goals  involved.  To  date,  these  libraries 
have  been  used  to  reconstruct  near  complete  “population  genomes”  from  low 
complexity  communities  (e.g.  Tyson  et  al.,  2004,  Strous  et  al.,  2006),  and  also 
used  for  gene-centric  approaches  in  more  complex  environments  (e.g.  Yutin  et 
al.,  2007).  In  both  cases,  these  data  are  used  to  look  at  the  distribution  and 
diversity  of  organisms  and  provide  insight  into  their  metabolic  and  functional 
capabilities. 

Large-insert  environmental  genomic  clone  libraries  typically  employ  a 
bacterial  artificial  chromosome  (BAG,  insert  size  -20-1 60kb)  or  a  fosmid  (insert 
size  ~35-40kb)  to  capture  large  genomic  fragments  from  a  cross-section  of 
individual  microorganisms  from  the  environment.  Such  large-insert  libraries  have 
been  constructed  from  a  number  of  habitats  and  extensively  screened  for  clones 
of  interest  (e.g.  Rondon  et  al.,  2000,  Beja  et  al.,  2002a,  Zeidner  et  al.,  2003,  de  la 
Torre  etal.,  2003,  Treusch  etal.,  2004,  Beja  et  al.,  2000,  Sabehi  etal.,  2005, 
Frigaard  etal.,  2006,  Neufeld  et  al.,  2008).  16S-  or  ITS-profiling  of  libraries  has 
also  been  proxy  for  profiling  of  the  communities  themselves  (e.g.  Suzuki  et  al. 
2004,  Martin-Cuadrado  etal.,  2007,  Neufeld  etal.,  2008).  End-sequencing  of 
large-insert  libraries  has  been  used  to  describe  the  taxonomic  and  metabolic 
profile  of  the  community  (e.g.  DeLong  et  al.,  2006,  Appendix  4,  Martin-Cuadrado 
et  al.,  2007),  while  full-sequencing  of  particular  clones  has  been  used  to 
investigate  genomic  context  for  particular  groups  or  processes  of  interest  (e.g. 
Beja  et  al.,  2002a,  Zeidner  et  al.,  2003,  Hallam  et  al.,  2006,  Frigaard  et  al.,  2006, 
Martinez  et  al.,  2007,  McCarren  et  al.,  2007,  Neufeld  et  al.,  2008). 

Metaqenomics  without  cloning:  Cloning  of  unamplified  total  DNA  (shotgun 
cloning)  and  subsequent  sequencing  is  being  eclipsed  by  highly-parallel,  clone- 
free  sequencing  technologies,  such  as  pyrosequencing  (described  above).  Such 
clone-free  sequencing  avoids  cloning  biases,  and  is  cheaper  per  base  pair 
obtained.  Currently  this  approach  suffers  only  from  short  read  lengths,  but  they 
are  quickly  increasing,  see  above.  There  have  been  a  number  of  pyrosequencing 
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studies  in  microbial  ecology  (e.g.  Mou  et  al.,  2008,  Dinsdale  et  al.,  2008b, 

Wegley  et  al.,  2007)  with  perhaps  the  most  advanced  to  date  including  a 
simultaneous  comparison  of  45  microbiomes  and  42  viromes  (Dinsdale  et  al., 
2008a),  and  an  ocean  microbial  metagenome  versus  meta-transcriptome  study 
(Frias-Lopez  &  Shi  et  al.,  2008).  Although  these  studies  offer  previously 
unobtainable  insights  into  microbial  communities,  made  possible  by  the  sheer 
depth  of  sequencing  involved,  care  must  be  taken  in  comparing  studies  using 
unamplified  DNA  with  those  amplifying  DNA  prior  to  sequencing.  Multiple 
displacement  amplification  (MDA),  using  the  phi29  polymerase,  has  been  used  in 
a  number  of  pyrosequencing  studies  (e.g.  in  some  but  not  all  of  those  reported  in 
Dinsdale  et  al.,  2008b),  and  its  biases  in  complex  mixed  community  samples 
have  not  yet  been  described. 

The  metagenomic  approach,  by  bypassing  the  preconceptions  (e.g.  about 
particular  known  sequence  and  metabolic  diversity)  inherent  in  the  design  and 
implementation  of  other  profiling  methods,  offers  microbial  communities  the 
clearest  opportunity  to  “tell  the  story”  of  what  is  important  in  their  world.  In 
addition  to  the  potential  biases  of  pre-sequencing  amplification,  however,  a  major 
caveat  of  both  cloning-based  and  cloning-free  metagenomics  is  that  the  bulk  of 
sequence  space  remains  unexplained  and  undefined,  with  the  majority  of 
metagenome  reads  representing  sequences  of  unknown  function. 

Community  Profiling  Conclusions 

A  wealth  of  profiling  tools  are  available  for  characterizing  microbial 
communities,  and  each  has  its  strengths  and  weaknesses.  As  the  price  of 
sequencing  continues  to  fall  it  will  replace  other  methods,  as  it  has  already  begun 
doing.  In  the  interim,  and  to  direct  sample  choice  for  community  sequencing 
efforts,  alternative  profiling  methods  will  continue  to  be  useful,  and  remain  widely 
applied  in  the  field.  Furthermore,  some  of  these  methods  have  uses  other  than 
just  community  profiling  (e.g.  FACS  and  FISH)  and  so  will  remain  important  tools 
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for  years  to  come. 

Presentation  of  the  Genome  Proxy  Array  in  this  Thesis 

This  thesis  describes  the  development,  testing,  and  application  of  a  new 
microarray-based  tool  for  community  profiling.  It  builds  upon  a  pre-existing 
knowledge  of  communities  of  interest  derived  from  the  sequencing  of  clones  from 
environmental  large-insert  clone  libraries,  and  of  cultured  isolates  from  related 
habitats.  Like  other  indirect  tools  for  microbial  community  profiling  (Rohwer, 
2007),  the  genome-proxy  microarray  is  expected  to  be  mostly  obsolete  within  five 
years  and  entirely  obsolete  in  10,  as  sequencing  costs  decrease  and  massive 
sequencing  is  feasible  for  high-resolution  spatial  and  temporal  studies  of 
microbial  communities,  in  research  labs  at  all  funding  tiers. 

Chapter  2  of  this  thesis  has  two  sections.  The  first  places  the  genome  proxy 
array  in  the  context  of  marine  microbial  research,  and  expands  upon  the 
applications  of  other  microarrays  to  microbial  ecology.  The  second  section 
describes  the  design  and  validation  of  a  prototype  of  the  genome  proxy  array,  in 
a  published  paper. 

Chapter  3  is  a  manuscript  in  preparation  describing  the  design, 
development,  and  validation  of  the  expanded  genome  proxy  array,  and  its 
application  to  a  time  series  in  Monterey  Bay. 

Chapter  4  summarizes  the  work  and  outlines  the  next  directions  for  the  use 
of  this  tool  during  its  remaining  lifetime  of  relevance. 

Appendices  include  the  methods  developed  for  this  array  platform,  a  primer 
on  array  design,  and  two  papers  I  have  been  involved  in  during  my  PhD  whose 
scope  pertains  to  the  topics  covered  in  Chapter  1. 
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Figure  1.  Community  profiling  methods  in  microbial  ecology.  A  community  sample  may 
be  treated  in  a  number  of  ways  during  profiling,  as  presented  in  this  review  (greyed 
sections  are  not  covered  in  depth).  Dashed  lines  indicate  less  common  links  between 
methods. 
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Chapter  2 


Development  and  Validation  of  a  Prototype  Genome  Proxy  Array. 


2a.  The  case  for  the  genome  proxy  array:  motivation  and  context. 


2b.  Rich,  V.I.,  K.  Konstantinidis,  and  E.F.  DeLong,  2008.  Design  and 
testing  of  ‘genome-proxy’  microarrays  to  profile  marine  microbial  communities. 
Environmental  Microbiology.  10(2):  506-521. 
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I.  The  Case  for  Microarravs  in  Marine  Microbial  Ecology 


As  details  of  the  marine  microbial  communities  have  been  revealed  at  ever- 
finer  scales,  the  complexity  and  variability  of  marine  microbial  diversity  continue 
to  astonish  (e.g.,  Acinas  &  Klepac-Ceraj  et  a!.,  2004,  Thompson  et  al.  2005, 
Rocap  et  al.,  2003).  In  tandem,  the  importance  of  these  communities  to  global 
biogeochemistry  has  become  increasingly  evident  (e.g.  Howard  et  al.,  2006,  Karl 
et  al.,  2007,  Moran  and  Miller,  2007,  Kuenen  et  al.,  2008).  This  deepening 
knowledge  of,  and  respect  for,  marine  microbial  communities,  particularly  as 
scientific  and  societal  concern  over  global  change  grows,  has  led  to  increasing 
research  efforts  devoted  to  further  understanding  their  composition,  dynamics, 
and  functional  capacities,  and  how  observed  genomic  variability  relates  to 
functional  variability  at  the  level  of  organism  and  system. 

Historically,  the  majority  of  marine  microbial  community  studies  have  focused, 
of  necessity,  on  specific  phylogenetic  or  functional  groups,  or  taken  a  coarser- 
grained  higher-taxonomic-level  approach.  Recently  new  methods  have  allowed 
the  scope  of  microbial  community  investigations  to  broaden  without  coarsening 
and  to  encompass  entire  co-occurring  microbial  communities  and  their  metabolic 
potential.  Despite  the  deluge  of  community  genomic  sequence  information 
brought  by  the  field  of  marine  microbial  metagenomics  (e.g.  Venter  et  al.,  2004, 
Delong  etal.,  2006  (Appendix  4),  Rusch  et  al.,  2007,  Yooseph  et  al.,  2007, 
Martin-Cuadrado  etal.,  2007,  Mou  et  al.,  2008,  Dinsdale  et  al.,  2008)  our 
understanding  of  marine  microbial  ecology  remains  far  from  complete. 

Thus  far  marine  metagenomic  studies  have  been  snapshots  of  communities, 
and  have  been  enormously  informative  but  do  not  reveal  community  dynamics. 
Microbial  ecology  is  at  a  cross-roads,  with  a  need  for  repeated  community-level 
sampling  to  better  understand  both  the  variability  of  the  tools  themselves 
(notably,  metagenomics  is  single  sequence  datasets  randomly  sampled  from 
large  pooled  sample),  and  the  actual  variability  of  communities  across  time  and 
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space,  under  both  natural  and  perturbed  conditions.  For  the  next  several  years, 
community  genome  sequencing  will  remain  prohibitively  costly  for  such 
widespread  use  for  ecological  studies.  These  investigations  require  inexpensive, 
high-throughput  methods  for  assessing  microbial  diversity  and  function. 
Microarrays  are  a  high-throughput  tool  that  permit  the  highly-parallelized 
simultaneous  query  of  many  targets. 


II.  Previous  and  Current  Microarrav  Applications  in  Microbial  Ecology 

Over  the  last  5-10  years,  microarrays  have  become  an  increasingly  common 
tool  in  microbial  ecology  and  span  increasingly-diverse  designs  and  goals 
(recently  reviewed  in  Gentry  et  al.,  2006,  Wagner  et  al.,  2007).  The  majority  of 
microbial  microarrays  designed  for  environmental  use  can  be  broadly  grouped 
into  three  categories:  functional  gene  arrays  (FGAs),  16S  arrays  (“Phylochips"), 
and  community  genome  arrays,  with  the  last  being  least  common. 


1.  Functional  Gene  Arrays.  FGA  probes  may  be  either  PCR  products  or 
oligonucleotides.  Sequence  information  may  be  pulled  directly  from  the 
environment  of  interest  by  PCR-amplification  of  the  relevant  functional  gene(s), 
and  then  using  the  amplicons  directly  as  probes,  or  cloning  and  sequencing  them 
and  designing  oligonucleotides  to  target  their  diversity  (e.g.  Taroncher-Oldenburg 
et  al.,  2003).  In  general,  FGA  studies  include  an  amplification  of  the  sample  DNA 
prior  to  hybridization,  using  primers  specific  to  the  functional  gene(s),  to  enrich 
for  the  sequences  being  targeted.  FGAs  have  been  developed  for  a  number  of 
important  microbially-mediated  biogeochemical  transformations,  such  as 
methanotrophy  (Bodrossy  et  al.,  2003,  Stralis-Pavese  et  al.,  2004,  Cebron  et  al., 
2007,  Gebert  et  al.,  2008)  and  marine  nitrogen  cycling  (Taroncher-Oldenburg  et 
al.,  2003,  Moisander  et  al.,  2007,  Ward  et  al.,  2007,  Zhang  et  al.,  2007). 
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The  most  ambitious  FGA  to  date  is  the  “GeoChip”  platform  designed  by 
Jizhong  Zhou  and  colleagues  (reviewed  in  He  etal.,  2007,  with  design  details  in 
Wu  et  al.,  2001,  Rhee  et  al.,  2004,  Tiquia  et  al.,  2004,  He  et  al.,  2005,  and 
Liebich  et  al.,  2006).  Rather  than  targeting  a  single  functional  gene  or  pathway, 
the  GeoChip  targets  most  major  nutrient  uptake  and  transformation  pathways 
that  have  been  identified  in  cultivated  organisms,  with  >24,000  probes  targeting 
>10,000  genes,  in  more  than  150  functional  groups  (see  table  1  at  end).  These 
targeted  groups  span  the  carbon,  nitrogen  and  sulfur  cycles,  as  well  as  metal  and 
organic  pollutant  transformations.  Variants  of  the  GeoChip  have  been  used  to 
investigate  microbial  responses  to  anthropogenic  soil  contamination  from  organic 
pollutants  and  metal  (Rhee  et  al.,  2004)  and  radioelements  (He  et  al.,  2007).  In 
addition,  the  GeoChip  has  been  used  to  assess  microbial  activity,  by 
hybridization  to  amplified  community  RNA  from  a  uranium-contaminated  soil 
(Gao  et  al.,  2007). 


2.  Phylochips.  Phylochips  typically  target  16S  rRNA  sequences,  and  typically 
have  hierarchically-nested  sets  of  probes  tailored  to  differing  levels  of 
phylogenetic  specificity.  Hybridization  usually  occurs  to  PCR-amplified  16S  rRNA 
genes  from  an  environment,  but  may  also  use  extracted  unamplified  16S  rRNA. 

There  are  three  major  types  of  phylochips,  based  on  their  target  breadth 
and  selection.  Phylochips  may  be  designed  to  particular  monophyletic  functional 
guilds,  to  particular  phylogenetic  clades,  or  to  a  broad  swath  of  phylogenetic 
diversity.  The  first  type  is  valuable  when  investigating  microbial  metabolisms  in 
environmental  community  settings,  and  complements  the  functional  genes 
approach  described  above.  Here,  all  phylotypes  known  to  be  responsible  for  a 
given  process  are  targeted  via  their  16S  sequences.  Applications  of  this  guild- 
specific  Phylochip  approach  include  sulfate-reducers  (Small  et  al.,  2001;  Koizumi 
et  al.,  2002;  Loy  et  al.,  2002)  and  hydrocarbon-degrading  microbes  (Koizumi  et 
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a/..  2002). 


The  second  class  of  Phylochips  target  monophyletic  clades,  regardless  of 
their  functional  homogeneity.  Examples  of  these  arrays  include  the  “RHC- 
PhyloChip",  targeted  to  Rhodocyclales  (Loy  et  at,  2005),  the  “ECC-Phylochip” 
targeted  to  Enterococcus  spp.  (Lehner  et  a!.,  2005),  a  marine  V'/d/'/o-specific 
array  (employing  both  16S  and  23S  probes,  Marcelino  et  al,  2006),  and  an  array 
for  rhizosphere  Alphaproteobacteria  (Sanguin  et  al.,  2006). 


The  third  class  of  Phylochips  target  many  phylogenetic  clades,  such  as  a 
broad  suite  of  pathogenic  bacteria  (via  gyrB  rather  than  165;  Kostic  et  al.,  2007), 
or  sediment  genotypes  (Peplies  et  al.,  2006,  el  Fantroussi  et  al.,  2003,  Eyers  et 
al,  2006,  Neufeld  et  al.,  2006).  Two  such  generalist-Phylochips  warrant  special 
mention  for  their  remarkable  breadth.  The  Brown  and  Reiman  Labs  have 
developed  a  16S-based  array  for  profiling  gut  microbiota,  the  current  iteration  of 
which  has  ~3,100  species-level  probes  and  ~6,000  group-level  probes  (Palmer  et 
al.,  2007;  the  prototype  version  targeted  229  species  and  130  higher  nodes. 
Palmer  et  al.,  2006).  This  array  has  been  used  with  great  success  in  a  ground¬ 
breaking  study  of  the  development  of  the  gut  microbiome  in  human  infants 
(Palmer  et  al.,  2007).  A  second  generalist  Phylochip  has  been  developed  for 
environmental  microbes  by  the  Andersen  Lab,  and  targets  -9,000  distinct  OTUs 
(see  Table  2  at  end)  (Brodie  et  al.,  2006,  DeSantis  et  al.,  2005)  and  has  been 
applied  to  the  study  of  microbiota  in  urban  aerosols  (Brodie  et  al.,  2007), 
subsurface  soils  and  waters  (DeSantis  et  al.,  2007),  as  well  as  cystic  lung  fibrosis 
patients  (V.  Klepac-Ceraj,  pers.  comm.).  A  similarly  large-scale  phylochip  has 
subsequently  been  published  and  used  for  oral  microbial  communities  (Huyghe 
et  al,  2008). 
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3.  Community  Genome  Arrays  (CGAs).  In  contrast  to  using  16S  probes  as  a 
hook  for  target  genotypes  of  interest,  another  approach  is  to  target  entire 
genomes  or  genome  fragments.  Such  arrays  have  been  developed  for  several 
habitats  by  using  genomes  of  cultivated  organisms  and  environmental  isolates. 
CGA  probes  are  entire  genomes,  rather  than  the  oligonucleotides  or  PCR 
products  typical  of  phylochips  and  FGAs.  Community  genome  arrays  have  been 
successfully  applied  to  explore  dynamics  of  cultivars  and  isolates  from  soils,  river 
and  marine  sediments  (Wu  et  al.  2004,  2006),  kimchi  fermentation  (149  lactic 
acid  bacterial  genomes  were  monitored  during  fermentation,  Bae  et  al.  2005). 

The  CGA  approach  was  developed  out  of  earlier  lower-throughput  community 
genome  hybridization  efforts,  which  had  been  used  to  explore  oil  fields,  salt 
marshes,  and  acid  mine  drainage  sediments  (reviewed  in  Greene  and  Voordouw, 
2003). 

Community  genome  arrays  generally  have  species-level  specificity,  and 
may  even  be  specific  to  the  targeted  strain,  depending  on  hybridization 
conditions.  However,  because  each  probe  represents  an  entire  genome,  the 
cross-hybridization  of  different  strains  with  similar  overall  identities  to  the  target 
strain  cannot  be  distinguished  from  one-another.  In  addition,  with  strain-level 
specificity  sometimes  difficult  to  achieve,  the  resolution  of  CGAs  may  be  lower 
than  the  ideal  for  ecology  studies,  since  recent  genome  research  shows  that 
closely  related  genotypes  can  have  significant  ecological  differences  (e.g., 
Thompson  et  al.  2005,  Coleman  et  al.  2006).  Most  importantly,  however,  CGA  as 
described  uses  cultivated  organisms  or  isolates  as  targets,  both  of  which  are 
unlikely  to  represent  groups  that  are  abundant  in  situ. 

To  overcome  this  last  limitation,  the  community  genome  array  approach 
could  work  equally-well  with  cloned  and  captured  genome  fragments  from  the 
environment.  This  alternate  design,  of  first  conducting  extensive  genomic 
surveys  of  the  environment  of  interest,  and  then  designing  an  array  using  that 
sequence  information,  was  described  in  the  literature  several  years  ago  years 
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(Zhou,  2003;  Greene  and  Voordouw,  2003),  but  has  not  yet  been  realized  for  the 
purposes  of  microbial  ecology. 

Environmental  genomic  libraries  have  previously  been  combined  with 
microarray  technology,  but  with  the  goal  of  screening  the  libraries  rather  than 
exploring  the  environment  (Sebat  et  ai,  2003,  Park  et  al,  2008).  The  library 
inserts  themselves  serve  as  probes  on  a  microarray,  which  is  then  used  to  query 
pure  cultures,  enriched  treatments,  and  natural  communities.  The  results  identify 
specific  clones  for  further  investigation,  such  as  end-  or  full-sequencing;  for 
example,  if  querying  enrichments,  the  results  can  provide  information  on  the 
potential  importance  of  a  given  clone  in  a  specific  process.  This  application  of 
microarrays  -  using  the  environment  to  look  back  and  define  a  library  -  is  in 
some  ways  the  inverse  of  the  community  genome  proxy  approach. 


4.  The  Virochip:  Lastly,  although  viral  ecology  has  not  been  the  focus  of  this 
overview,  a  virally-targeted  microarray  platform  has  been  developed  that  is 
conceptually  distinct  from  the  microbial  arrays  described  above,  and  represents  a 
highly  successful  application  of  microarray  technology  to  complex  natural 
communities.  Furthermore,  this  alternative  design  was  the  inspiration  for  the 
genome  proxy  array  described  in  this  thesis.  Wang  et  a/.'s  “virochips”  (2002), 
like  CGAs,  start  from  the  knowledge  of  entire  genomes'  sequences.  The  crucial 
difference  is  that  instead  of  immobilizing  the  entire  viral  genomes,  oligonucleotide 
probes  for  each  open  reading  frame  (ORF)  are  created.  Using  custom  software 
(ArrayOligoSelector,  Bozdech  et  al.,  2003)  sets  of  70-mer  probes  for  each  viral 
genome  were  targeted  to  sequences  of  high  conservation  among  the  viral 
genome  database,  on  the  theory  that  these  conserved  probes  would  be  the  most 
likely  to  pick  out  novel  undescribed  virotypes  as  well,  particularly  among  quickly- 
evolving  viruses. 

The  first  iteration  of  the  virochip  was  specific  to  viruses  involved  in  head 
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colds,  and  was  initially  tested  on  RNA  from  infected  tissue  culture  cells.  The 
virochip  was  able  to  clearly  identify  the  presence  of  targeted  viruses  from  the 
pure  viral  cultures  of  tissue  culture  cells.  In  addition,  related  viruses  could  be 
detected  through  their  conserved  regions,  showing  distinctive  hybridization 
patterns  to  some  subset  of  the  probes  of  their  closest  relatives  -  what  Wang  et 
at.  (2002)  called  a  “viral  barcode".  Next,  the  Virochip  was  used  to  examine 
natural  community  samples,  by  adding  a  random-amplified  PCR  step  to  their 
protocol  to  obtain  cDNA  pools  from  the  nasal  lavages  of  purposefully-infected 
and  healthy  control  patients.  The  virochips  were  successfully  able  to  distinguish 
which  of  the  test  viruses  were  present  in  the  infected  patients.  Finally,  sick 
patients  with  unknown  and  in  some  cases  multiple  viruses  related  to  those 
targeted  were  tested,  and  the  virochip  could  identify  both  the  targeted  and  the 
novel  viruses.  The  phylogenetic  affiliation  of  these  novel  viruses  could  be 
determined  based  on  their  patterns  of  hybridization,  their  “barcodes”,  and  was 
confirmed  by  RT-PCR  with  family-specific  primers.  Thus,  even  within  complex 
natural  samples,  the  virochip  could  distinguish  related  serotypes  of  a  particular 
virus,  and  also  place  completely  unknown  samples  in  their  phylogenetic  context. 
Although  nasal  lavages  are  considerably  less  complex  than  a  typical  soil  or 
aquatic  microbial  community,  this  research  shows  that  by  using  not  one  but  a 
whole  suite  of  probes  for  a  given  species,  with  varying  levels  of  specificity,  a 
maximal  amount  of  information  and  resolution  can  be  obtained. 


5.  Limitations  of  Array  Platforms.  There  are  several  important  caveats  in 
contemplating  the  use  of  microarrays  in  microbial  ecology,  related  to  their 
inherent  limitations  and  also  to  the  methods  associated  with  their  use. 

First,  arrays  provide  only  relative  rather  than  absolute  quantification; 
different  probe,  even  designed  to  the  same  hybridization  parameters  in  silica,  can 
behave  differently  (e.g.  Kreil  etai,  2006).  Thus  while  the  correlation  between 
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target  abundance  and  array  signal  is  often  very  high  for  a  given  probe,  it  may  be 
vary  considerably  between  probes.  This  effect  can  be  ameliorated  but  not 
eliminated  by  the  use  of  multiple  probes  per  target.  (This  variability  in  probe 
sensitivity  can  be  seen  in  the  behavior  of  the  genome  proxy  array.  For  the 
Prochlorococcus  MED4  probe  set,  the  signal  correlation  to  cell  concentration  had 
an  of  1.0,  as  seen  in  Figure  5a  of  Rich,  Konstantinidis  and  DeLong,  2008. 
However,  across  all  targeted  genotypes,  the  correlation  of  array  intensities  to 
pyrosequencing-inferred  target  abundance  ranged  from  an  R^  of  0.85  -  0.91 ,  as 
seen  in  Figure  2  of  Rich  et  a!.,  in  prep..  Chapter  3).  The  conclusion  is  that  arrays 
are  most  robust  for  assessing  the  relative  changes  in  each  target  between 
samples,  and  somewhat  less  accurate  at  quantifying  the  relative  abundances 
between  targets. 

Second,  microarrays  can  generally  only  provide  information  about  what  is 
represented  on  the  array.  Some  platforms  such  as  the  larger  Phylochips,  the 
Virochip  and  the  genome  proxy  array  allow  significant  and  meaningful  cross¬ 
hybridization  to  genotypes  not  explicitly  represented  by  the  array,  but  are  still 
limited  to  sequences  related  to  those  targeted.  This  “you  can  only  see  what  you 
look  for”  drawback  is  not  limited  to  arrays,  and  can  have  significant 
consequences  to  our  ecological  interpretations.  Any  method  that  brings  a  filter  to 
our  observation  potentially  excludes  important  data.  For  example,  FGAs  and 
guild-specific  Phylochips  rest  upon  the  completeness  of  our  understanding  of  the 
process  of  interest,  at  the  time  of  the  design  of  the  array.  This  completeness  may 
be  flawed  in  several  major  ways.  Probes  in  the  above  types  of  arrays  examine 
only  the  already-recognized  participants  in  the  process  of  interest,  and  their  close 
relatives.  Recent  discoveries  in  microbial  ecology  have  proven  that  our  picture  of 
even  things  as  basic  as  phototrophy  may  be  much  narrower  than  what  is 
common  -  let  alone  present  -  in  Nature  (e.g.,  Beja  et  at.,  2000b).  Not  only  may 
there  be  as-yet  undiscovered  genes  and  pathways  mediating  processes  we  are 
trying  to  map,  but  the  connection  between  16S  identity  and  coherence  of 
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organism  function  is  poorly  understood.  Organisms  with  highly  similar  16S 
identities  may  have  quite  dissimilar  overall  gene  contents  and  consequently 
occupy  distinct  niches  and/or  play  different  ecosystem  roles.  Finally,  this  potential 
myopia  is  exacerbated  when  using  specific-primer-directed  PCR  in  the  creation 
and  design  of  the  probes  or  preparation  of  the  target.  Not  only  does  PCR  have 
potential  reaction-based  errors  and  biases  (e.g.  Thompson  et  a!.,  2002),  but  also 
primers  may  not  amplify  as  comprehensively  as  they’re  assumed  to  (for  example, 
“universar-primer-based  surveys  missed  an  entire  high-level  taxonomic  group, 
the  kingdom  Nanoarchaeota;  Huber  e/a/.,  2002). 


III.  The  Genome  Proxy  Array. 


Over  the  last  eight  years,  a  number  of  large-insert  (fosmid  and  BAC 
vector)  genomic  libraries  from  marine  picoplankton  (Beja  e/a/.,  2000a)  collected 
from  a  variety  of  ecologically-relevant  depths  in  two  oceanic  habitats  (the  coastal 
waters  of  Monterey  Bay,  and  the  oligotrophic  open  ocean  at  Station  ALOHA  in 
the  North  Pacific  Subtropical  Gyre)  have  become  available.  These  libraries  have 
been  surveyed  to  characterize  their  phylogenetic  and  functional  gene  content 
(Beja  et  al.,  2002a:  Zeidner  e/a/.,  2003;  Suzuki  et  at.  2004;  DeLong  et  al.,  2006; 
Hallam  et  al.,  2006),  thousands  of  clones  have  been  end-sequenced  (DeLong  et 
al.,  2006,  Appendix  4),  and  many  clones  have  been  fully  sequenced  (Stein  et  al., 
1996,  B6ja  et  al.,  2000b,  Beja  et  al.,  2002a,  Beja  et  al.,  2002b,  de  la  Torre  et  al., 
2003,  Sabehi  et  al.,  2004,  Coleman  et  al.,  2006,  Frigaard  et  al.,  2006,  Grzymski 
et  al.,  2006,  Martinez  et  al.,  2007,  McCarren  et  al.,  2007).  This  wealth  of 
environmental  genomic  data  provides  an  unprecedented  window  into  marine 
microbes,  the  majority  of  which  remain  uncultivated,  and  their  communities. 

By  designing  a  microarray  with  existing  genomic  survey  data  from  the  target 
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ecosystem,  a  broader  sampling  of  the  ecosystem’s  microbial  diversity  can  occur. 
This  has  two  important  benefits:  (i)  the  diversity  being  assessed  is  intimately 
linked  to  potential  function,  since  a  large  piece  of  each  organism’s  genome  is 
known  and  many  genes  along  that  piece  are  being  targeted,  and  (ii)  because  a 
large  section  of  target  sequence  is  known,  multiple  probes  can  be  designed  to 
target  each  genome,  allowing  more  conclusive  identification  of  community 
members,  and  meaningful  cross-hybridization  to  their  relatives. 

From  the  sequence  data  in  available  environmental  genomic  libraries  from 
Monterey  Bay  and  Station  ALOHA,  I  designed  70-mer  oligonucleotide  probes 
and  created  a  prototype  marine  picoplankton  microarray.  The  prototype  “genome 
proxy”  array  targeted  thirteen  sequenced  environmental  large-insert  clones  and 
the  full-sequenced  marine  cyanobacterium  Prochlorococcus  MED4.  The 
development  and  testing  details  are  described  in  the  published  paper  Rich, 
Konstantinidis  and  DeLong,  2008,  which  comprises  the  second  section  of  this 
chapter. 
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Chapter  2b 


Design  and  testing  of  ‘genome-proxy’  microarrays  to  profile  marine  microbial 
communities. 
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Design  and  testing  of  ‘genome-proxy’  microarrays 
to  profile  marine  microbial  communities 


Virginia  I.  Rich,’  Konstantinos  Konstantinldis*^  and 
Edward  F.  DeLong’  ** 

’The  MIT-WHOI  Joint  Program  in  Biological 
Oceanography,  and  *  The  Department  of  Civil  and 
Environmental  Engineering,  MIT,  48-427,  15  Vassar  St,, 
Cambhdge,  MA  02139,  USA. 

Summary 

Mlcroarrays  are  useful  tools  for  detecting  and  quan¬ 
tifying  specific  functional  and  phylogenetic  genes  In 
natural  microbial  communities.  In  order  to  track 
uncultivated  microbial  genotypes  and  their  close 
relatives  In  an  environmental  context,  we  designed 
and  implemented  a  ‘genome-proxy’  microarray  that 
targets  microbial  genome  fragments  recovered 
directly  from  the  environment.  Fragments  consisted 
of  sequenced  clones  from  large-insert  genomic  librar¬ 
ies  from  microbial  communities  in  Monterey  Bay, 
the  Hawaii  Ocean  Time-series  station  ALOHA,  and 
Antarctic  coastal  waters.  In  a  prototype  array,  we 
designed  probe  sets  to  13  of  the  sequenced  genome 
fragments  and  to  genomic  regions  of  the  cultivated 
cyanobacterium  Prochlorococcus  MED4.  Each  probe 
set  consisted  of  multiple  70-mers,  each  targeting  an 
Individual  open  reading  frame,  and  distributed  along 
each  -40-160  kbp  contiguous  genomic  region.  The 
targeted  organisms  or  clones,  and  close  relatives, 
were  hybridized  to  the  array  both  as  pure  DNA  mix¬ 
tures  and  as  additions  of  cells  to  a  background  of 
coastal  seawater.  This  prototype  array  correctly  Iden¬ 
tified  the  presence  or  absence  of  the  target  organisms 
and  their  relatives  In  laboratory  mixes,  with  negligible 
cross-hybridization  to  organisms  having  ^  -75% 
genomic  identity.  In  addition,  the  array  correctly  iden¬ 
tified  target  cells  added  to  a  background  of  environ¬ 
mental  DNA,  with  a  limit  of  detection  of  -0.1%  of  the 
community,  corresponding  to  -10^  cells  ml  ’  In  these 
samples.  Signal  correlated  to  cell  concentration  with 
an  of  1.0  across  six  orders  of  magnitude,  in 
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addition,  the  array  could  track  a  related  strain  (at  86% 
genomic  Identity  to  that  targeted)  with  a  linearity  of 
0.9999  and  a  limit  of  detection  of  -1%  of  the 
community.  Closeiy  reiated  genotypes  were  distin¬ 
guishable  by  differing  hybridization  patterns  across 
each  probe  set.  This  array’s  multiple-probe,  ‘genome- 
proxy’  approach  and  consequent  ability  to  track  both 
target  genotypes  and  their  close  relatives  Is  Important 
for  the  array’s  environmental  application  given  the 
recent  discoveries  of  considerable  Intrapopulatlon 
diversity  within  marine  microbial  communities. 

Introduction 

Microarrays  are  currenlly  applied  in  microbial  ecology  lo 
assay  Ihe  presence  of  particular  organisms  or  genes  in 
Ihe  environmeni  (for  a  receni  review,  see  Gentry  etal, 
2006).  Bolh  funclional  gene  arrays  and  phylogenetic 
arrays  have  been  developed  for  several  systems  and 
guilds,  Including  sulfale  reducers  (Small  etal.,  2001; 
Koizumi  etal.,  2002;  Loy  etal.,  2002),  methanotrophs 
(Bodrossy  etal.,  2003,  Stralis-Pavese  e/a/.,  2004), 
hydrocarbon  degraders  (Koizumi  etal.,  2002)  and 
microbes  involved  in  Ihe  nilrogen  cycle  (Wu  etal.,  2001; 
Taroncher-Oldenburg  etal.,  2003,  Tiquia  etal..  2004). 
among  olhers  (Cho  and  Tiedje,  2002;  Greene  and  Voor- 
douw,  2003;  Rhee  etal.,  2004,  He  etal.,  2007). 

In  addilion,  several  larger  16S  microbial  microarrays 
have  been  developed  lo  widen  Ihe  phylogenelic  scope  of 
diversify  investigations  (eg.  Wilson  etal.,  2002;  Mar¬ 
cel  ino  etal.,  2006;  Palmer  et  al ,  2006;  Brodie  et  a/.,  2007; 
DeSantis  etal.,  2007),  employing  a  hierarchical  probe 
design  (i.e.  some  probes  specific  lo  class,  olhers  lo  order, 
etc.)  crilical  for  robust  interpretalion  This  approach  ideally 
builds  upon  a  thorough  survey  of  Ihe  major  rRNA  gene 
diversity  in  an  environment  prior  to  the  array  design,  as 
was  undertaken  for  Ihe  human  gut  microflora  prior  to  the 
construction  of  a  rRNA-largeted  gul -census  array  (Palmer 
et  a/.,  2006)  Such  phyloarrays  provide  a  high-Ihroughput 
approach  lo  determining  community  composition,  while 
funclional  gene  arrays  have  the  polenlial  lo  map  a  com¬ 
munity’s  funclional  capabilities.  Few  array  platforms  have 
yet  emerged  lo  bridge  Ihe  Iwo,  however;  the  links 
between  16S  identity  and  funclional  capacity  remain 
unclear,  as  Iwo  organisms  with  highly  similar  16S  rRNA 
sequences  may  have  dislinci  ecological  capabilities  and 
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genetic  content  (e.g.  Rocap  efa/.,  2003;  Konstantinidis 
and  Tiedje,  2005). 

Another  approach  to  designing  microarrays  would  be  to 
target  genome  fragments  that  represent  both  the  phylo¬ 
genetic  and  functional  breadth  of  a  communfty,  to  permit 
finer-scale  tracking  of  populations.  This  requires  genomic 
surveys  of  the  environment  of  interest  (as  suggested  by 
Greene  and  Voordouw,  2003;  Zhou,  2003)  to  provide  the 
native  genomic  sequence  information  for  probe  design. 
While  several  arrays  have  been  designed  using  some 
sequences  derived  from  the  community  of  interest  (for 
example  with  environmental  isolates’  ON  A,  Greene  and 
Voordouw,  2003;Wu  era/.,  2004;  or  1  kb  inserts  from  a 
groundwater  cosmid  library,  Sebat  era/.,  2003),  a  more 
comprehensive  use  of  uncultivated  genomic  data  for 
arrays  has  not  yet  been  reported. 

We  describe  here  the  design  and  implementation  of  a 
prototype  oligonucleotide  microarray  targeting  environ¬ 
mentally  occurring  bacterial  and  archaeal  genotypes, 
which  were  characterized  through  the  recovery  and  analy¬ 
ses  of  large  genomic  fragments  from  marine  plankton.  A 
number  of  large-insert  genomic  libraries  have  previously 
been  constructed  from  marine  picoplankton  collected 
from  different  depths  in  several  oceanic  habitats:  the 
coastal  waters  of  Monterey  Bay  (B6j^  e/a/.,  2000a),  the 
oligotrophic  open  ocean  at  the  Hawaii  Ocean  Time  series 
(DeLong  e/a/.,  2006),  and  Antarctic  coastal  waters  (B6j^ 
e/a/.,  2002a).  These  libraries  have  been  surveyed  to 
characterize  their  phylogenetic  and  functional  content 
(B6j^  e/a/.,  2002a;  Zeidner  e/a/.,  2003;  Suzuki  e/a/., 
2004;  DeLong  e/a/.,  2006;  Hallam  e/a/.,  2006),  tens  of 
thousands  of  clones  have  been  end-sequenced  (DeLong 
e/a/.,  2006),  and  hundreds  of  clones  have  been  fully 
sequenced  (Stein  e/a/.,  1996;  B6j^  e/a/.,  2000b; 
2002a, b;  de  la  Torre  e/a/.,  2003;  Sabehi  e/a/.,  2004; 
Coleman  e/a/.,  2006;  Frigaard  e/a/.,  2006;  Grzymski 
e/a/.,  2006;  McCarren  and  Delong,  2007;  Martinez  e/a/., 
2007). 

The  ’genome-proxy’  array  targeted  ecologically  relevant 
marine  microbes  through  sets  of  probes  designed  to 
these  genome  fragments,  which  served  as  ’proxies’ 
for  the  genomes  of  these  uncultivated,  unsequenced 
microbes.  The  array’s  specificity  and  sensitivity  were 
tested  against  laboratory  mixes,  and  to  cells  added  to 
natural  seawater  samples  at  a  variety  of  concentrations, 
under  various  hybridization  conditions. 


Results 
Array  design 

The  prototype  microarray  targeted  13  BAC  or  fosmid 
genome  fragments  (20-1 60  kb)  from  both  bacteria  and 
archaea  (Table  1),  recovered  from  a  variety  of  marine 
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Table  2.  Library  collection  information  and  references. 


Library 

name 

Collection  location 

Collection 
depth  (m) 

Date  of 
collection 

Vector 

typo 

Library 

reference 

EBOOO 

Monterey  Bay:  36.7 '  N.  122.4  W;  near  Station  M2 

3 

3/17-99 

BAC 

Beja  et  al.  (2002b) 

EB080 

Monterey  Bay.  36'’45.50N.  122"02.10W 

80 

7/23/99 

BAC 

Suzuki  etaf.  (20041 

FF100 

Monterey  Bay,  36M5.50N,  122"02.10W 

too 

2/21/02 

pE  FIFOS 

Suzuki  etaf  (2004) 

EB750 

Monterey  Bay.  36M1.13N.  122’02.37W 

750 

4/11/00 

BAC 

Suzuki  etal.  (2004) 

HOT 

Station  ALOHA.  22  75^N. 

0 

12/11/01 

plndigoBAC536 

de  la  Torre  (2003) 

ANT 

Coastal  waters  near  Palmer  Station. 

Anvers  Island.  Antarctica 

0 

8/96 

pFost 

Beja  et  al.  (2002a) 

ORE 

Oregon  coast  (44^02. 729N.  124 -SZ.SOgW) 

200 

8/29/92 

pFost 

Stem  etal.  (1996) 

habitats  (Table  2),  as  well  as  the  cyanobacterium  Prochlo- 
rococcus  MED4.  These  clones  were  originally  sequenced 
because  of  the  presence  of  taxonomic  marker  or  specific 
functional  genes.  This  array  consisted  of  sets  of  70  bp 
oligonucleotides  targeting  each  genome  or  genome  frag¬ 
ment  (Fig.  1),  dispersed  along  the  target  sequences  with 
no  more  than  one  probe  per  gene,  and  excluding  rRNA 
genes  as  targets.  The  probes  were  selected  solely  based 
on  theoretical  thermodynamic  properties  and  GC  content 
(-40%);  that  is.  probe  selection  did  not  focus  on  specific 
genes  or  regions,  but  simply  produced  the  ’optimal’ 
probes  for  each  genome  proxy  based  on  the  probes’ 
predicted  hybridization  properties.  rRNA  genes  were 
excluded,  because  this  probe  design  approach,  which 
avoids  sequence  alignments  and  considerations  of  RNA 
secondary  structure,  would  be  unlikely  to  result  in  useful 
rRNA  probes.  Furthermore,  rRNA  probes  of  traditional 
design  could  not  be  included  on  the  array  because  their 
appropriate  hybridization  conditions  would  be  very  differ¬ 
ent  from  those  of  this  array’s  probes. 

Array  specificity 

When  hybridized  to  mixtures  of  cloned  environmental 
genomic  DNA  targeted  by  the  array,  the  array  produced 


signal  from  the  correct  probe  sets,  with  no  appreciable 
cross-hybridization  to  other  probe  sets.  For  example,  a 
mix  of  DNA  from  clones  ORE_4B7.  EB750_10A10. 
EB080_L12H07  and  HOT_02C01  produced  above¬ 
background  signals  from  only  the  corresponding  probe 
sets  on  the  array  (data  not  shown).  When  equal  amounts 
of  each  clone  DNA  were  hybridized,  the  mean  signal  for 
each  genotype  was  not  equivalent,  reflecting  microarrays' 
relative  -  rather  than  absolute  -  quantification  abilities 
due  to  variability  in  probe  hybridization  signal  (e.g.  Kreil 
etal.,  2006).  The  use  of  multiple  probes  to  target  many 
genes  from  each  organism  helped  to  normalize  probe-to- 
probe  heterogeneity,  by  averaging  across  all  probes  in 
a  set  (as  described  below).  The  evenness  of  probe 
response  across  each  genotype’s  set  was  also  used 
to  evaluate  the  relatedness  of  hybridizing  DNA  (see 
below). 

To  more  precisely  define  the  array’s  phylogenetic  range 
and  specificity,  it  was  tested  against  DNA  from  Prochloro 
coccus  MED4  and  related  strains,  spanning  the  known 
range  of  Prochiorococcus  phylogenetic  diversity  (Fig.  2a 
and  Table  3).  The  majority  of  tests  used  four  strains: 
MED4,  the  strain  explicitly  targeted  by  the  array; 
MtT95l5,  the  only  cultivated  sister  strain  to  MED4  within 
the  high-light  clade  II  (clade  definitions  sensu  Rocap 
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Flg.1.  Overview  ot  array  design  and  use. 
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Fig.  2.  Specificity  tests  with  Prochlorococcus  strains. 

a.  The  Prochlorococcus  strains  tested  span  the  clade's  phylogenetic  diversity  (ITS-based  tree  modified  from  Rocap  eta/.,  2002). 

b.  The  mean  normalized  signal  across  the  Prochlorococcus  MED4  probe  set  was  highest  when  the  array  was  hybridized  to  pure  MED4  DNA. 
Hybridization  separately  to  other  Prochlorococcus  strains  produced  decreasing  signal  as  the  phylogenetic  distance  increased.  The 
hybndization  data  shown  here  and  in  (c)  and  (d)  were  from  1 .8  ng  of  each  strain's  DNA. 

c.  The  evenness  of  the  signal  across  the  probe  set  also  decreased  with  increasing  evolutionary  distance. 

d.  The  ‘genome  proxy’  test.*  the  location  of  the  three  80  kb  targeted  regions  of  the  Prochlorococcus  MED4  genome,  and  the  relative  genomic 
identity  between  strains  MED4  and  MITSOIG  (also  see  Table  3  for  these  three  regions'  relative  genomic  identity  to  other  strains),  and  a 
representative  example  of  the  signal  across  the  three  probe  sets  designed  to  each  region  when  hybhdized  to  MED4  and  related  strains 
MiTg312  and  MITg313.  Region  1  is  0.80  kbp  along  the  MED4  genome,  region  2  is  1.2g-t.38  Mbp  along,  and  region  3  is  1 .58-1.66  Mbp 
along.  The  asterisk  indicates  that  region  2  spans  Isl5  island  of  inter-strain  genomic  variability  identified  by  Coleman  and  colleagues  (2006). 


etal.,  2002);  MIT9312,  from  the  high-light  clade  1;  and 
strain  MIT9313,  from  the  low-light  clade  IV.  DNA  from 
low-light  strains  NATL2A,  MIT9211  and  SSI 20  was  also 
tested. 


As  expected,  the  set  of  probes  targeting  Prochlorococ¬ 
cus  MED4  all  produced  signal  when  hybridized  to  pure 
MED4  DNA  (Fig.  2b  and  c).  When  other  Prochlorococcus 
strains  were  each  hybridized  separately  to  the  array,  both 


Table  3.  Prochlorococcus  strain  relatedness. 


Strain 

16S  identity 
to  strain  MED4 

Overall  ANI*  to 
strain  MED4 

ANI  to  MED4 

Region  1 

Region  2** 

Region  3 

MED4 

100% 

100%  (1658)" 

100%  (80) 

100%  (80) 

100%  (78) 

MITOSIS 

gg.9% 

86%  (1433) 

86%  (78) 

85%  (46) 

87.5%  (78) 

MITg312 

gg.1% 

78.5%  (1422) 

78%  (79) 

77%  (37) 

79%  (78) 

MIT9313 

07.9% 

64.5%  (403) 

64.5%  (20) 

65%  (6) 

64%  (28) 

a.  AN  I  is  average  nucleotide  identity,  calculated  per  Konstantinidis  and  Tiedje  (2005) 

b.  Region  2  spans  the  ISL5  genomically  variable  island  described  in  Coleman  and  colleagues  (2006). 

c.  In  parentheses  are  the  number  of  non-overfapping  tOOO  bp  fragments  with  BLAST-based  Identity. 
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the  mean  signal  and  the  evenness  of  signal  across  the 
MED4  probes  decreased  as  the  phylogenetic  distance  to 
MED4  increased  (Fig.  2b  and  c).  Consistently  across 
many  hybridizations.  MED4  probes  showed  strongest 
signal  when  hybridized  to  strain  MED4.  moderate  signal 
to  strain  MIT9515  (86%  genomic  identity  to  MED4),  lower 
signal  to  strain  MIT9312  (78.5%  genomic  identity),  and  no 
significant  signal  to  strain  MIT9313  (64.5%  genomic 
identity).  More  distantly  related  strains  NATL2A,  MIT9211 
and  SS120  produced  no  appreciable  signal  (data  not 
shown).  Furthermore,  the  probe  sets  targeting  en¬ 
vironmental  clones  did  not  show  appreciable  cross¬ 
hybridization  signal  against  Prochlorococcus  (Fig.  2b). 
While  the  overall  Prochlorococcus  signal  intensity 
decreased  trom  MED4  to  MIT9515  and  MIT9312,  the 
distribution  of  signal  across  the  individual  probes  in 
the  MED4  set  also  became  less  even.  For  example,  in  the 
hybridization  shown  in  Fig.  2c,  the  coefficient  of  variation 
(CV)  among  the  probes  Increased  from  0.79  for  MED4  to 
2.40  for  MIT931 2,  across  the  positive  control  probes  in  the 
same  two  hybridizations,  the  CV  was  1.01  and  0.97 
respectively. 

To  test  the  effects  of  hybridization  stringency  on  the 
specificity  and  signal  of  the  MED4  probes,  Prochlorococ¬ 
cus  strains  were  hybridized  at  a  range  of  conditions  (data 
not  shown).  In  general,  lowering  stringency  by  decreasing 
the  temperature  (65‘'C,  GO  ’C,  55'^C  to  50'’C),  or  increasing 
the  salt  concentration  in  the  hybridization  buffer  (3x  SSC, 
0,2%  SDS  to  3.5x  SSC,  0.3%  SDS),  produced  a  decrease 
in  the  array's  dynamic  range  (i.e.  less  signal  difference 
between  low  and  high  concentrations  of  a  given  strain), 
and  poorer  dischmination  among  related  strains.  The  pro¬ 
tocol  giving  the  best  dynamic  range  and  discrimination 
among  strains  (65X  and  3x  SSC.  0.2%  SDS  buffer,  see 
Experimental  procedures)  was  used  for  the  data  reported 
here  unless  otherwise  noted. 

To  test  whether  the  specificity  results  for  Prochlorococ¬ 
cus  were  comparable  for  other  targeted  clades,  two 
genome  fragments  recovered  from  closely  related  phy- 
lotypes  within  the  SAR86  clade  of  the  gammaprote- 
obacteria  were  represented  on  the  array,  and  were 
tested  for  specificity.  The  clones  HOT  04E07  from 
subclade  SAR86-I  (clade  placement  per  Sabehi  etal., 

2004)  and  EB000_31A08  from  subclade  SAR86-II 
are  syntenic,  97.5%  Identical  at  their  16S  genes,  and 
share  72%  genomic  identity  [calculated  as  average 
nucleotide  identity  (ANI),  Konstantinidis  and  Tiedje, 

2005] ,  When  the  array  was  hybridized  separately  to 
DNA  from  either  of  these  two  clones,  the  signal  of  the 
probes  targeting  the  other  was  within  the  background 
signal  (data  not  shown),  as  expected  from  the  Prochlo¬ 
rococcus  results.  These  results  demonstrate  that  the 
specificity  of  the  arrays  can  distinguish  between  closely 
related  phylotypes  of  yet-uncultivated  microorganisms. 


Effect  of  designing  probe  sets  to  different  regions 
of  a  target  genome 

To  understand  the  equivalence  of  probe  sets  targeting 
different  regions  of  the  same  organism’s  genome,  we 
targeted  three  80  kb  ‘genome-proxy’  regions  of  the 
Prochlorococcus  MED4  genome.  One  of  the  regions  fell  in 
a  genomic  ‘island’  where  inter-strain  variability  Is  concen¬ 
trated  (‘ISL5’  in  Coleman  etal.,  2006).  Shared  gene 
content  among  strains  was  variable  between  the  three 
regions,  while  the  sequence  identity  (as  ANI)  of  shared 
genes  among  strains  was  very  similar  between  the 
regions  (Table  3). 

When  hybridized  to  DNA  trom  MED4  and  related 
strains,  the  cumulative  signal  across  the  three  regions’ 
probe  sets  was  not  identical  (Fig.  2d),  as  expected  given 
probe-to-probe  signal  variability  (e.g.  Kreil  etal.,  2006), 
and  given  the  three  regions’  ditferences  among  strains. 
For  example,  between  the  target  strain  MED4  and  the 
strain  MIT9515,  Region  II  shows  57.5%  shared  gene 
content  (46  of  80  genes)  and  85%  genomic  identity,  while 
region  III  shows  100%  shared  gene  content  (78  of  78 
genes)  and  87.5%  genomic  identity.  The  hybridization 
signal  for  each  region  was  calculated  as  the  mean  signal 
across  all  probes  designed  from  that  region.  Despite  the 
ditterences  among  genomic  regions,  the  probe  sets 
designed  to  all  three  regions  were  effective  at  identifying 
both  targeted  and  related  genotypes.  Each  region’s  probe 
set  produced  maximal  signal  to  MED4,  with  decreasing 
signal  to  the  other  strains  as  phylogenetic  distance 
increased  (Fig  2d).  Across  the  three  regions,  this  relative 
decrease  in  signal  trom  MED4  to  MIT9515  and  MIT9312 
was  correlated  more  to  relative  genomic  identity  than  to 
relative  shared  gene  content  (average  Pearson  correla¬ 
tion  of  0  90  versus  0.70). 

Array  response  to  target  cells  in  natural  seawater 

To  test  the  array  in  a  complex  environmental  context,  we 
collected  coastal  seawater  (lacking  detectable  Prochloro¬ 
coccus  cells  by  flow  cytometry)  and  added  Prochlorococ¬ 
cus  cells  from  strains  MED4,  MIT9515,  MIT9312  and 
MIT9313  over  a  range  of  concentrations  from  ~10’  to  10^ 
cells  ml"*  (Fig.  3).  The  seawater  was  then  filtered,  and  the 
DNA  was  extracted,  amplitied,  labelled  and  hybridized  to 
the  array.  The  results  in  this  background  of  environmental 
DNA  agreed  generally  with  earlier  specificity  results  using 
DNA  from  laboratory  cultures.  MED4  probes  showed 
strong  signal  when  hybridized  to  strain  MED4,  moderate 
signal  to  strain  MIT9515,  and  no  significant  signal  to 
MIT9313  (Fig.  4a  and  b).  In  this  environmental  back¬ 
ground.  the  relative  signal  from  strain  MIT9312  was 
markedly  lower  than  observed  In  single-strain  laboratory 
hybridizations  (Fig.  4b  versus  Fig.  2b).  Thus,  the 
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Fig.  3.  Design  of  the  Prochlorococcus 
addition  experiment  to  a  natural  community. 
Coastal  seawater  was  collected,  and  cells 
from  Prochlorococcus  strains  MED4. 

MIT9515.  MIT93t2  and  MIT9313  were  spiked 
in  at  a  range  of  concentrations  from  -10’  to 
10®  cells  m^^ 


operational  phylogenetic  breadth  of  the  array  for  tracking 
related  genotypes  was  near  86%  genomic  identity  to  the 
target. 

To  further  explore  the  relationship  between  array  signal 
and  genomic  identity  to  the  target,  in  the  absence  of  a 
cultivated  strain  bridging  the  relatedness  between  strains 
MED4  and  MIT9515,  we  examined  the  subset  of  the 
MED4  probes  with  the  highest  identity  to  strain  MIT9515. 
These  probes  had  an  average  identity  of  92.4%  to  strain 
MIT9515,  while  the  average  genomic  identity  between 
MED4  and  MIT9515  is  86%.  The  signal  across  these 
probes  from  the  MIT9515  hybridization  was  intermediate 
between  the  MED4  and  MIT9515  signals  (Fig.  S3a  and 
b).  There  was  a  clear  linear  increase  in  array  signal  with 
increasing  genotype  identity  to  target  {R^  of  0.9959,  from 
78.5%  to  100%  identity.  Fig.  4b  inset). 

Correlation  between  cell  numbers  and  signal 

As  the  cell  concentrations  of  the  targeted  strain  MED4 
increased  across  six  orders  of  magnitude  within  a 
complex  natural  community,  the  mean  signal  intensity 
across  the  MED4  probe  set  increased  linearly,  with  an 
of  1.0  (Fig.  5a).  At  the  lowest  cell  concentrations  the 
signal  diverged  from  this  linear  relationship,  so  that 
the  operational  limit  of  detection  was  ~10^  cells  ml"''  of  the 
target,  which  in  these  coastal  water  samples  represented 
-0.1%  of  the  community. 

To  test  whether  this  linearity  would  hold  for  tracking 
related,  non-target  genotypes,  we  examined  the  cumula¬ 
tive  MED4  probe  signal  with  varying  cell  concentrations  of 
strain  MIT9515  (86%  genomic  identity  to  the  targeted 
strain  MED4).  As  cell  concentrations  of  strain  MIT9515 
increased  in  the  background  of  environmental  cells,  the 
mean  normalized  intensity  across  the  MED4  probe  set 
increased  linearly,  with  an  R^  of  0.9999  across  six  orders 
of  magnitude.  For  this  non-target  strain,  the  limit  of  detec¬ 
tion  was  around  10^  cells  ml“\  representing  approxi¬ 
mately  1%  of  the  community  (Fig.  5a). 


There  was  no  appreciable  correlation  between  cell  con¬ 
centrations  of  the  more  distantly  related  Prochlorococcus 
strains  and  mean  normalized  signal  across  the  MED4 
probes,  across  this  concentration  range  (R^  of  0.04  for 
strain  MIT9312  and  0.31  for  strain  MIT9313;  data  not 
shown). 

Array  data  metrics 

The  array  data  could  be  examined  in  two  ways,  either 
probe-by-probe  across  each  probe  set,  or  as  overall 
organism  signals.  In  addition,  at  a  given  hybridization 
stringency,  different  treatments  of  the  data  might  result  in 
different  degrees  of  apparent  cross-hybridization  between 
strains,  different  in  silico  stringency.  In  order  to  determine 
which  method  of  converting  the  individual  probe  signals 
into  an  overall  organism  signal  gave  optimal  discrimina¬ 
tion  between,  or  optimal  cross-hybridization  among, 
strains,  and  gave  optimal  correlation  to  cell  concentration, 
the  data  were  analysed  using  different  combinations 
of  metrics.  We  focused  primarily  on  the  data  from  the 
Prochlorococcus  addition  experiment,  as  being  the 
most  informative  and  representative  of  environmental 
data  sets.  All  data  presented  above  were  obtained  using 
the  optimized  analysis,  described  below. 

The  analysis  pipeline  began  by  taking  either  the  mean 
or  median  of  replicate  spots  of  each  probe,  minus  the 
mean,  median  or  Tukey  Biweight  of  the  negative  control 
probe  set.  [The  Tukey  Biweight  is  the  used  in  Affymetrix’s 
MASS  analysis  methods  to  calculate  the  signal  across 
sets  of  11-20  oligonucleotide  probes  targeting  a  single 
open  reading  frame  (ORF)  (Affymetrix,  2002);  it  weights 
each  value  based  on  its  proximity  to  the  median,  thereby 
reducing  the  effect  of  outliers.]  To  remove  the  effects  of 
non-discriminatory  probes,  we  tested  what  minimum  per 
cent  (Y%)  of  the  probes  in  each  probe  set  should  be 
required  to  show  signal  (greater  than  lx  or  2x  mean 
background,  or  greater  than  lx  or  2x  mean  negative 
control)  for  that  probe  set  to  be  considered  ‘present*.  Y 
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Fig.  4.  Dedphering  signal  of  related  strains  in  a  complex  background. 

a.  The  mean  normalized  signal  of  all  probes  sets  on  the  array,  from  one  representative  hybridization  series  {data  not  comparable  between 
series  because  of  different  salt  concentrations  and  temperatures  used)  of  the  expenment  described  in  Fig.  3.  Note  that  several  other  probe 
sets  showed  high  signal  indicating  the  likely  presence  of  other  targeted  genotypes  within  the  natural  community.  Specifically,  targeted 
environmental  genome  fragments  containing  proteorhodopsin  and  bacteriochlorophyll  genes  showed  signal  appreciably  above  background, 
consistently  across  the  many  aliquots  of  natural  water  used. 

b.  Focusing  just  on  the  data  for  the  MED4  probe  set  (red  bar  in  panel  a).  The  MED4  probes  showed  no  significant  signal  when  hybridized  to 
strain  MIT9313,  very  low  signal  to  strain  MITg312.  and  moderate  signal  to  strain  MIT9515.  Mixes  of  cells  from  the  four  Prochforococcus 
strains  tested  behave  as  expected  from  an  edditive  effect  of  their  respective  signals. 

Inset.  As  the  hybridized  strain  s  nucleotide  identity  to  the  target  strain  increased,  the  array  signal  increased  linearly,  above  the  limit  of 
hybridization  at  78.5%  identity.  Data  shown  are  the  -10®  cells  ml"’  additions  to  natural  seawater.  The  black-rimmed  circle  represents  a  ’virtual’ 
strain  representing  -92%  genomic  identity  to  MED4,  using  the  M1T9515  hybridization  deta  from  the  subset  of  MED4  probes  with  the  highest 
identities  to  MIT95t5,  on  average  92.4%  nucleotide  identity  (also  see  Fig.  S3).  The  value  is  calculated  for  signal  versus  identity  for 
MlTg312.  MIT9515  (across  all  probes),  the  ’virtual’  strain  and  MED4. 


could  not  equal  100  because  some  probes  were  poor 
performers,  and  because  we  wished  to  retain  the  signal 
from  related,  non-target  genotypes.  Next,  a  single  inten¬ 
sity  value  for  each  probe  set  was  calculated  as  the  mean, 
median  and  Tukey  Biweight  of  the  probe  signal  across 
each  probe  set.  Finally,  each  value  was  normalized  for 
array-to-array  brightness  by  the  mean,  median  or  Tukey 
Biweight  of  the  positive  control  probe  set. 

At  the  optimized  hybridization  conditions,  the  combina¬ 
tion  of  metrics  that  gave  the  best  correlation  between  cell 
concentration  and  signal,  and  also  produced  cross- 
hybridization  signal  of  the  MED4  probes  to  the  related 


strain  MIT9515,  was  the  following:  median  among  repli¬ 
cates,  then  the  mean  signal  across  probes,  minus  the 
mean  of  the  negative  control  probes,  normalized  to  the 
mean  of  the  positive  control  probes,  with  at  least  Y  =  45% 
of  the  probes  required  to  produce  signal  greater  than  2x 
mean  negative  control. 

By  lumping  related  genotypes  together  as  a  single 
signal,  this  combination  of  metrics  had  only  -1 0-fold 
difference  limit  of  resolution  between  samples  (e.g.  in 
Fig.  5a.  and  10®  cells  ml”'  of  MIT9515  gave  approximately 
the  same  signal  as  would  -2.5  x  10^  cells  mT’  of  MED4), 
and  missed  underlying  changes  in  population  structure. 


MED4  trendline: 
y  =  7E-05X  4-  0.2801 
R'  =  1 


M179515  trendline: 
y  *  2E-05X  +  0.225 
R'  a  0.9999 
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MED4  trendline; 
y  -  7E-05x  0  0614 

R2  =  0.9999 


M 1795 15  trendline; 
y  -  3E-07X  +  0.1301 

R2  =  0.8538 
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Fig.  5.  Signal  versus  cell  concentration. 

a.  As  the  MED4  cell  concentration  increased  across  six  orders  of  magnitude,  the  mean  normalized  intensity  across  the  MED4  probe  set  (filled 
circle)  increased  linearly  with  an  of  1.0.  The  operational  limit  of  detection  is  around  10*  cells  mh'  of  the  target,  which  in  these  coastal  water 
samples  represents  approximately  0.1%  of  the  community  As  cell  concentrations  of  strain  MIT9515  increased  in  the  beckground  of 
environmental  cells,  the  mean  normalized  intensity  (filled  square)  across  the  MED4  probe  set  increased  linearly,  with  an  of  0.99  across  six 
orders  of  magnitude.  For  this  non-target  strain  with  99.9%  16S  rRNA  identity  end  86%  overall  genomic  identity  1o  the  terget,  the  limit  of 
detection  is  around  10^  cells  ml  ’,  representing  approximately  1%  of  the  community 

b.  If  the  normalized  Tukey  Biweight  signal  across  the  probe  set  is  used  instead  of  the  mean,  the  stringency  of  the  hybridization  is  increased 
dramatically  in  silico  such  that  there  is  much  greater  separation  of  the  MED4  and  MIT9515  signals. 
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Thus,  a  secondary  metric  was  also  used,  the  Tukey 
Biweight  across  probes  in  a  set,  in  which  non-target  signal 
was  dramatically  reduced  (Figs  5b  and  S1).  For  example, 
target  signal  to  10®  cells  ml  ^  of  MED4  decreased  only  4% 
between  the  mean  and  Tukey  Biweight  metrics,  while 
non-target  signal  decreased  to  10®  cells  ml"'  of  MIT9515 
decreased  98%  (Figs  5b  and  SI).  To  explore  the  effect  of 
using  the  Tukey  Biweight  as  the  genomic  identity  to  target 
increased,  we  also  examined  the  Virtual’  strain  composed 
of  the  subset  of  MED4  probes  with  the  highest  nucleotide 
identities  to  strain  MIT9515  (92.4%,  versus  86%  between 
the  strains  overall,  as  described  above).  The  Tukey 
Biweight  reduced  the  signal  from  these  probes,  e.g.  by 
40%  for  the  1 0®  cells  ml  '  data,  resulting  in  a  signal  inter¬ 
mediate  between  the  MED4  signal  and  the  whole  probe- 
set  MIT9515  signal  (Fig.  S3b). 

To  better  distinguish  target  from  non-target  (but  related) 
signal,  we  also  tested  metrics  for  measuring  the  signal 
evenness  across  each  probe  set.  (For  example,  10*® 
cells  ml  '  of  MIT9515  gave  a  different  pattern  of  hybrid¬ 
ization  across  the  MED4  probe  set  than  would  -  2.5  x  lO"* 
cells  ml  '  of  MED4,  despite  their  mean  signal  being  iden¬ 
tical;  Fig.  5a  versus  Figs  2c  and  S2c,e).  We  calculated  the 
Shannon,  Simpsons  and  modified  Simpsons  evenness 
metrics  borrowed  from  ecology,  and  the  CV.  The  CV  maxi¬ 
mized  the  separation  between  the  target  strain  MED4  and 
the  related,  detected  strain  MIT9515  (data  not  shown); 
however,  none  of  these  evenness  metrics  captured  the 
sequential  pattern  of  hybridization  across  probes.  These 
measures  could  distinguish  target  from  non-target  popu¬ 
lations,  but  could  not  track  shifts  between  two  related, 
non-target  populations. 

To  further  compare  hybridization  patterns  between 
samples,  we  tested  the  Pearson  correlations  between  the 
probe-by-probe  signals  of  each  probe  set.  The  hybridiza¬ 
tion  pattern  was  consistent  within  strains,  across  all  con¬ 
centrations  above  the  limit  of  detection.  The  Pearson 
correlation  of  the  probe-by-probe  signal  for  the  Prochlo- 
rococcus  probes  was  significantly  higher  between  any  two 
hybridizations  of  the  same  strain,  than  it  was  between 
strains.  For  strains  MED4,  MIT9515  and  MIT9312  (diver¬ 
gent  strain  MIT9313  was  omitted  because  its  cross- 
hybridization  was  near-background  signal),  at  lO'-IO® 
cells  ml"’,  Pearson  correlation  within  strains  was  on 
average  0.73  (SD  =  0.1 8),  versus  0.44  on  average  across 
any  two  strains  (SD  =  0.18).  These  correlations  were  sig¬ 
nificantly  different  at  P  -  0.000  by  a  Student’s  equal- 
variance  (satisfied  by  F-test-0.95)  two-tailed  f-test.  To 
test  the  effect  of  higher  genomic  identities,  the  Pearson 
correlations  were  also  calculated  using  the  higher-identity 
‘virtual  strain’  probe  set.  For  these  probes’  patterns  In  the 
hybridizations  to  MED4,  MIT9515  and  MIT9312,  the 
average  within-strain  correlation  was  0.74  (SD  =  0.18). 
while  the  be  tween-strain  correlation  dropped  to  0.49 


(SD  =  0.18),  and  these  were  significantly  different  at 
P=  0.000  (F-test  =  0.95).  Lastly  exact  replicates  of  the 
same  cell  concentration  produced  a  higher  Pearson 
correlation,  as  might  be  expected.  The  same  amount  of 
positive  control  DNA  was  added  (pre-amplification  and 
labelling)  to  all  Prochlorococcus  addition  experiments,  so 
was  represented  by  27  replicates  at  the  optimal  hybrid¬ 
ization  conditions  reported  here.  The  hybridization  pat¬ 
terns  from  the  positive  control  probe  set  had  an  average 
Pearson  of  0.90  (SD  =  0.13)  between  replicates.  Thus  the 
hybridization  pattern  across  a  probe  set  can  be  compared 
between  samples,  to  allow  the  discrimination  of  different 
populations  of  cells. 

Array  response  to  mixed  populations  of 
related  genotypes 

Mixtures  of  Prochlorococcus  strains  were  also  added  to 
seawater  samples  to  test  the  performance  of  the  array 
when  challenged  with  mixtures  of  the  target  and  its 
relatives,  in  a  community  background.  Specifically,  four 
mixtures  were  tested:  (i)  10®  cells  ml  '  of  all  four  strains 
(MED4.  MIT9515,  MIT9312,  MIT9313);  (ii)  10®  cells  ml" 
of  MED4  and  10^  cells  mM  of  the  other  three  strains;  (iii) 
10®  cells  mb’  of  MIT9515  and  10^  cells  mb’  of  the  other 
three  strains;  and  (iv)  10®  cells  ml"’  of  M1T9312  and  10’^ 
cells  ml"'  of  the  other  three  strains.  The  cumulative  signal 
from  each  mix  was  essentially  equivalent  to  the  additive 
signal  of  the  component  populations;  the  presence  of 
related  genotypes  did  not  interfere  with  the  hybridization 
of  strains  that  could  bind  the  MED4  probes  (Fig.  4a). 
Furthermore,  by  Pearson  correlation,  the  mixed  samples 
in  which  one  genotype  dominated  gave  patterns  that  were 
distinct  from  one  another.  The  pattern  produced  by  Mix  3 
(dominated  by  MED4  cells)  was  different  from  that  by  Mix 
2  (dominated  by  MIT9515  cells),  with  a  Pearson  of  only 
0.38  (Fig.  S2).  Lastly,  by  using  the  Tukey  Biweight  instead 
of  the  mean,  the  contribution  of  non-target  cells  was  mark¬ 
edly  reduced  (Fig.  SI). 

Array-based  observations  of  natural 
community  microbes 

In  addition  to  the  added  Prochlorococcus  cells,  the 
coastal  water  samples  from  Woods  Hole  showed  the 
consistent  presence  of  several  of  the  targeted  genotypes 
in  the  natural  microbial  community.  Probe  sets  designed 
to  proteorhodopsin-containing  environmental  clones 
EB000  _41B09,  EBOOO  .55611,  and  EBOOO_31A08,  and 
bacteriochlorophyll  operon-containing  clone  EBOOO_ 
60D04,  all  showed  above-background  signal  at  con¬ 
sistent  levels  across  almost  all  of  the  27  experiments 
hybridized  at  the  optimal  conditions  (Fig.  4a).  This  signal 
persisted  even  when  using  the  Tukey  Biwelght  metric 
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(Fig.  S1),  which  reduces  signal  from  related  non-target 
genotypes,  suggesting  that  the  genotypes  present  in  the 
natural  seawater  were  closely  related  to  the  targeted 
genotypes.  Furthermore,  the  signal  from  each  of  these 
four  probe  sets  showed  high  agreement  in  pattern 
among  hybridizations.  The  average  Pearson  among  all 
EB000_31A08  hybridizations  was  0.74  (SD  =  0.18),  for 
EB000_60D04  the  average  was  0.84  (SD  =  0.12), 
for  EB000_55B11  it  was  0.88  (SD  =  0.09),  and  for 
EB00O_41B09  it  was  0.81  (SD  =  0.17).  This  suggests  that 
the  genotypes  present  were  similar  among  samples 
tested. 

Discussion 

The  development  of  the  genome-proxy  microarray 
approach  was  motivated  by  the  need  for  high-throughput 
tools  to  track  marine  microbes  in  a  community  context. 
The  array  described  here  represents  a  complementary 
approach  to  existing  array  platforms  for  microbial  ecology. 
While  previous  microarrays  used  in  microbial  ecology 
have  primarily  targeted  single  functional  or  phylogenetic 
genes,  the  arrays  described  in  this  report  target  unculti¬ 
vated  microbes  through  ’genome  proxies’,  fragments  of 
native  genomes  captured  from  the  environment.  Each 
genome  Iragment  was  targeted  with  a  set  of  probes, 
selected  based  on  predicted  hybridization  characteristics 
and  GC  content.  Each  probe  targeting  a  given  genome 
could  then  ’vote’  on  the  presence  of  the  target  organism, 
averaging  across  the  variable  individual  probe  responses. 
By  targeting  sections  of  genomes  rather  than  single 
genes,  this  array  was  able  to  track  clusters  of  related 
genotypes  in  the  environment  at  a  relatively  high  level  of 
resolution,  while  simultaneously  tracking  a  subset  of  each 
genotype’s  individual  genes.  These  abilities  distinguish 
the  ’genome-proxy’  array  from  single-gene  arrays. 

We  characterized  the  phylogenetic  specificity  and 
experimental  sensitivity  of  this  array.  It  was  able  to  detect 
targeted  organisms  in  simple  laboratory  mixtures,  and  in 
complex  backgrounds  of  environmental  DNA.  Prochloro- 
coccus  was  used  as  the  primary  test  clade  (Fig.  2a) 
because  of  its  relevance  in  marine  plankton,  and  the 
availability  of  culturable  strains,  many  with  associated 
genomic  and  ecological  Information  (e.g.  Partensky  ef  a/., 
1 999;  Moore  etai,  2002;  Scanlan  and  West,  2002;  Rocap 
etal.,  2003;  Bouman  etai,  2006;  Coleman  efa/.,  2006; 
Johnson  et  a/.,  2006;  Zinser  et  a/.,  2006;  Garczarek  et  a/., 
2007).  Thus,  the  degree  of  cross-hybridization  to  the 
Prochforococcus  MED4  probes  by  DNA  from  other  strains 
could  be  placed  in  the  context  of  their  genetic  and  eco¬ 
logical  relatedness,  providing  a  model  for  the  array’s  phy¬ 
logenetic  specificity  within  an  environmental  context. 

Under  the  hybridization  conditions  and  analysis 
methods  used,  in  hybridizations  to  both  pure  DMAs  and  to 


cells  spiked  into  natural  community  samples,  the  array 
showed  negligible  cross-hybridization  to  distantly  related 
genotypes  [less  than  -78.5%  genomic  nucleotide  identity 
(ANI)  to  the  target]  (Figs  2b  and  4b).  Cross-hybridization 
signal  from  indiscriminate  probes  was  further  removed  by 
requiring  each  probe  set  to  show  signal  above  a  threshold 
value  in  a  certain  percentage  of  its  probes.  Closely  related 
genotypes  were  consistently  detected  by  the  array  (at 
86%  genomic  identity  to  target)  (Figs  2b  and  4b).  There 
was  a  strong  correlation  =  0.9959)  between  the  mean 
signal  and  the  identity  of  the  hybridized  genotype  to  the 
target,  above  78.5%  genomic  identity  (Fig.  4b  inset). 

This  ability  to  track  both  targets  and  their  relatives  rep¬ 
resented  both  a  benelit  and  a  challenge.  II  the  signal  from 
all  detected  relatives  of  a  given  target  were  lumped 
together  into  a  single  signal  for  that  target,  then  the  array’s 
limit  of  resolution  would  be  -10-fold  change  between 
samples,  with  cross-hybridization  to  relatives  indistin¬ 
guishable  from  up  to  10-lold  changes  In  target  abun¬ 
dance,  and  underlying  changes  in  population  structure 
would  be  missed.  However,  the  array’s  multiprobe  design 
allowed  for  more  nuanced  analysis.  Probe  sets  could 
‘vote’  through  either  a  permissive  metric,  the  mean  signal 
across  the  set,  or  a  non-perm issive  metric,  the  Tukey 
Biweight  signal  across  the  set  (e.g.  Fig.  5a  and  b).  Thus. 
Irom  a  single  hybridization,  the  data  could  be  interpreted 
broadly  or  stringently  in  sifico,  to  cast  the  net  narrowly  for 
the  targeted  organisms  or  more  broadly  to  Include  their 
close  relatives,  down  to  at  least  -86%  genomic  identity. 

The  evenness  and  pattern  of  the  signal  across  the 
probe  set  also  provided  important  information  and 
allowed  discrimination  of  the  target  genotype  from  that 
of  its  relatives,  and  close  relatives  from  one  another 
(Fig.  2c).  The  probe -by-probe  signal  patterns  of  a  given 
strain  in  different  hybridizations  were  significantly  more 
highly  correlated  to  one  another,  regardless  of  cell  con¬ 
centration,  than  to  the  patterns  of  different  strains.  In 
mixtures  ol  related  genotypes,  the  pattern  of  the  most 
abundant  genotype  dominated,  such  that  shifts  in  popu¬ 
lation  structure  between  mixes  were  evidenced  by  quite 
distinct  hybridization  patterns  (Fig.  S2).  Tbis  feature  of  the 
array  allowed  it  to  track  shifts  in  population  structure 
between  samples  spiked  with  cells  of  single  and  multiple 
strains. 

To  understand  the  ecology  of  organisms  over  lime,  it  is 
Ideal  to  track  not  only  their  presence  and  absence  but  also 
their  relative  abundance,  making  It  important  to  under¬ 
stand  how  the  microarray  signal  related  to  the  target 
organism’s  abundance.  This  array  showed  a  highly  linear 
relationship  between  cell  concentrations  and  signal,  even 
in  an  environmental  background.  This  linearity  held  for 
both  the  targeted  strain  (MED4;  R^  =  1.0)  and  its  relative 
(MIT9515;  FF  =  0.9999)  when  using  the  mean  normalized 
intensity  across  probes  (Fig.  5a).  The  limit  of  detection  for 
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the  targeted  strain  MED4  was  approximately  10^ 
cells  ml"’,  or  -0.1%  of  the  community,  and  10“*  cells  mf~’ 
for  strain  MIT9515,  - 1%  of  the  community.  This  limit  of 
detection  is  equivalent  to  or  below  that  reported  for  other 
recent  environmental  microbiology  microarrays  (e.g. 
Rhee  etal.,  2004;  Loy  etal.,  2005;  Gentry  etal.,  2006). 

Not  only  was  the  array  able  to  track  added  target  cells 
and  their  relatives  in  a  complex  background,  but  it  also 
Identified  the  likely  presence  of  other  targets  in  coastal 
waters,  in  a  different  oceanic  province  than  those  from 
which  the  target  clones  originated.  The  genome  proxies 
whose  probe  sets  produced  signal  in  the  Woods  Hole 
water  contain  proteorhodopsin  or  bacteriochlorophyll 
operons,  and  represent  putatively  phototrophic  organ¬ 
isms,  which  are  predicted  to  occur  in  such  a  habitat  (Beja 
etal.,  2002b).  The  consistency  of  their  array  signal  in 
samples  from  many  aliquots  of  adjacent  water,  by  both 
overall  mean  and  Tukey  Biweight  signal,  and  by  hybrid¬ 
ization  pattern  (Fig.  4a),  strongly  suggested  that  each 
target  was  present  in  the  community,  and  that  its  popula¬ 
tion  structure  did  not  vary  significantly  across  the  spatial 
scales  spanned  by  these  aliquots. 

The  design  approach  of  using  suites  of  probes  to  assay 
for  each  organism  was  a  crucial  feature  of  this  array.  The 
power  of  the  multiprobe-per-target  design  approach  has 
been  employed  by  other  array  platforms,  although  their 
different  goals  have  required  distinct  design  and  analysis 
strategies.  For  example,  Affymetrix  arrays  use  multiple 
short  oligonucleotide  probes  (in  some  cases  tiled  at 
regular  intervals)  to  assay  each  gene  or  gene  product,  but 
seek  very  high  specificity,  whereas  our  arrays  seek  to 
track  both  target  sequences  and  their  relatives,  within  a 
complex  environmental  background.  Our  goals  are  more 
comparable  to  that  of  the  ’Virochip’  microarray  used  for 
viral  identification  (Wang  etal.,  2002)  in  clinical  samples. 
However,  its  design  employed  viral  genome  alignments, 
with  hierarchical  probes  selected  to  conserved  and  vari¬ 
able  regions.  In  contrast,  the  approach  described  here 
made  use  of  the  higher  degree  of  sequence  conservation 
within  microbes  (compared  with  viruses).  By  selecting 
oligonucleotide  probes  based  primarily  on  their  hybridiza¬ 
tion  kinetics  and  without  requiring  alignments  to  related 
sequences,  we  sidestep  the  problem  of  limited  and  differ¬ 
entially  distributed  sequence  coverage  In  different  habi¬ 
tats  and  of  different  clades. 

The  use  of  this  genome  proxy  array  raises  the  ques¬ 
tion  as  to  whether  an  organism  can  be  targeted  based  on 
a  subset  of  its  genome.  Do  probe  sets  designed  to  dif¬ 
ferent  genomic  regions  give  substantially  different  results 
for  the  presence,  absence  or  relative  abundance  of  an 
organism?  The  environmental  clones  targeted  by  this 
array  represent  20-1 60  kb  sections  of  genome.  Popula¬ 
tion  genomic  variability  is  unevenly  dispersed  along 
genomes,  concentrated  in  hypervariable  regions  (e.g.  as 


in  Prochlorococcus,  Coleman  etal.,  2006),  such  that 
some  percentage  of  environmental  genomic  clones 
capture  hypervariable  genomic  regions.  If  such  regions 
were  targeted,  the  resulting  probe  sets  might  be  so 
genotype-specific  that  they  would  produce  little  cross¬ 
hybridization  to  close  relatives.  However,  we  do  not 
anticipate  this  being  a  significant  problem  in  the  use  and 
expansion  of  this  array,  for  several  reasons.  First,  the 
environmental  clones  that  are  sequenced  and  used  for 
probe  design  tend  to  be  16S-containing  clones,  and  16S 
operons  are  not  in  hypervariable  regions.  Second,  even 
when  somewhat  variable  regions  are  captured  and  tar¬ 
geted,  as  in  this  microarray’s  second  targeted  region 
of  the  Prochlorococcus  genome,  which  spanned  the 
ISL5  island  of  inter-strain  variability,  the  probe  sets  still 
cross-hybridize  to  related  genotypes  (Fig.  2d).  Signal 
intensity  was  correlated  more  strongly  with  the  identity  of 
shared  genes  than  with  the  overall  shared  gene  content 
in  a  region.  Thus,  except  in  extreme  cases  of  hypervari- 
able  island  capture  (which  would  likely  be  identifiable  by 
gene  content  anomalies,  i.e.  high  numbers  of  integrases. 
transposases  and  hypothetical  genes,  and  therefore 
avoided),  targeting  environmental  genomic  clones  as 
described  here  should  allow  the  subsequent  tracking  of 
their  relatives.  Furthermore,  this  approach  is  robust  to 
genomic  rearrangements  among  strains,  as  it  assays  the 
presence  or  absence  of  sections  of  DNA  rather  than  their 
relative  positions. 

Frequently  observed  in  the  oceans,  highly  similar  but 
non-identical  microbial  genotypes  tend  to  share  a  high 
degree  of  synteny  and  minimal  nucleotide  variation 
across  their  genomes  (Coleman  etal.,  2006;  Rusch  etal., 
2007).  For  example,  only  3-5%  variation  in  nucleotide 
identity  in  surface-ocean  Prochlorococcus  MIT931 2-like 
sequences  is  usually  observed  in  natural  populations 
(Coleman  etal.,  2006;  Rusch  etal.,  2007).  Thus,  the 
array’s  ability  to  track  related  genotypes  suggests  its  suit¬ 
ability  for  identifying  and  tracking  microbes  at  the  relevant 
levels  of  sequence  divergence  found  in  native  microbial 
populations.  The  empirical  results  with  Prochlorococcus 
genome  fragments,  along  with  the  SAR86  and  pufLM- 
containing  genotypes  we  detected  in  Woods  Hole  seawa 
ter,  also  support  this  conclusion. 

Furthermore,  this  degree  of  specificity  should  allow  the 
arrays  to  detect  previously  unrecognized  ecotypes  within 
uncultivated  target  lineages.  Overall  genomic  Identity  Is 
clearly  more  sensitive  than  16S  rRNA  identity  at  discern¬ 
ing  closely  related  populations,  with  organisms  highly 
similar  at  the  16S  level  sometimes  occupying  quite  dis¬ 
tinct  niche  space  (e.g.  Jaspers  and  Overmann.  2004; 
Hahn  and  PdckI,  2005;  Johnson  etal.,  2006).  The 
microarray  approach  described  here  has  the  potential  to 
track  shifts  In  populations  of  closely  related  genotypes 
under  changing  environmental  conditions. 
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In  cases  where  a  functional  gene  of  interest  is  present 
on  a  targeted  clone,  these  arrays  also  may  be  able  to 
match  the  distribution  and  expression  of  the  gene  to  that 
of  its  ‘owner’.  This  tool  has  the  potential  to  simultaneously 
assay  both  DNA  and  RNAfrom  environmental  samples,  to 
track  not  only  which  targeted  genotypes  are  present  but 
also  which  are  functionally  active.  This  will  improve  our 
understanding  of  microbial  activity  and  dormancy  in  dif¬ 
ferent  environmental  conditions.  It  may  also  indicate  when 
functionally  Important  genotypes  are  missing  from  our 
targets,  for  example  when  the  DNA  and  RNA  signal  for  a 
given  gene  is  high  but  that  of  its  genome  proxy  overall  is 
low.  There  is  even  the  possibility  of  using  the  array  ele¬ 
ments  as  capture  probe,  to  further  characterize  novel 
environmental  sequences,  as  hybridized  DNA  and  RNA 
can  be  recovered  directly  from  arrays  to  be  clones  and 
sequenced  (Wang  efa/.,  2003). 

We  expect  the  ’genome-proxy’  oligonucleotide  mlcroar- 
ray  to  be  a  useful  tool  for  conducting  high-throughput 
investigations  of  microbial  distributions,  community 
dynamics  and  functional  activity.  The  multiprobe  design 
strategy  results  in  hybridization  signal  that  is  dependent 
on  genomic  Identity  to  the  targeted  organism,  across  the 
region  targeted.  This  allows  not  only  the  tracking  of  clus¬ 
ters  of  related  genotypes  in  the  environment,  but  also  the 
distinction  among  related  genotypes,  by  using  the  pattern 
and  evenness  of  the  signal  across  each  probe  set  as  a 
’barcode’  of  each  different  genotype.  This  ability  gives  the 
array  the  potential  to  map  shifts  in  population  structure.  In 
addition,  this  allows  the  array’s  use  in  geographically  dis¬ 
parate  but  similar  habitats,  as  considerable  sequence 
divergence  is  tolerated.  No  sequence  alignments  are 
required,  obviating  the  need  for  coverage  of  the  phyloge¬ 
netic  space  surrounding  the  targeted  organism.  Also,  by 
using  genome  proxies  rather  than  single  genes  to  target 
organisms,  there  is  additional  20-1 60  kb  genomic  context 
available,  potentially  expandable  by  locating  contigs. 

With  these  prototype  arrays  now  validated,  we  are  con¬ 
structing  an  expanded  microarray  representing  hundreds 
of  genotypes  from  different  depths  in  open  and  coastal 
oceans.  These  will  be  used  to  track  microbial  community 
and  population  changes  in  time-series  datasets  (with 
accompanying  physical  and  chemical  data)  to  provide 
a  higher-resolution  understanding  of  the  dynamics  of 
marine  microbial  communities. 

Experimental  procedures 
Culturing  and  DNA  extractions 

Prochhrococcus  strains  MED4,  MIT9515,  MIT9312, 

NATL2A,  MIT9211  and  SS120  and  MIT9313  were  grown  in 
250  ml~^  I  cultures  of  Sargasso  seawater-based  Pro99 
medium  (Moore  etal.,  2002),  under  continuous  light  condi¬ 
tions  at  20°C.  High-light  strains  (per  Rocap  et  a/.,  2002)  were 


grown  at  35  \imo\  photon  m"^  S''  light  intensity,  while  low-light 
strains  were  grown  at  18-20  4mol  photon  m'^  s  \  DNA  was 
extracted  according  to  a  modified  phenol -chloroform  protocol 
(Steglich  etai,  2003)  and  treated  with  RNase  at  50  pg  ml"’ 
final  concentration  for  37°C  for  1  h,  then  re-extracted.  DNA 
from  the  DeLong  Lab  library  environmental  clones  used  in 
this  study  was  extracted  from  overnight  cultures  using  either 
Qiagen  miniprep  kits  (Qiagen,  Valencia,  California)  or  an 
AutoGenprep  960  (AutoGen,  Holliston,  Massachusetts)  auto¬ 
mated  extraction  robot,  followed  by  treatment  to  digest 
Escherichia  coli  DNA  with  ATP-de pendent  exonuclease 
(Epicentre,  Madison,  Wl)  according  to  the  manufacturer’s 
instructions.  DNA  concentrations  were  measured  using  an 
ND-1000  spectrophotometer  (Nanodrop  Technologies,  Wilm¬ 
ington,  Delaware).  The  positive  control  Halobacterium 
salinarum  NRC-1  DNA  was  purchased  (#700922,  ATTC, 
Virginia). 

Additions  ofProchlorococcus  to  natural 
seawater  samples 

Coastal  seawater  was  collected  from  the  Woods  Hole,  MA, 
town  pier  using  a  rinsed  bucket  and  transported  to  MIT  in  a 
50  I  carbuoy;  Prochhrococcus  was  undetected  in  this  water 
by  flow  cytometry  (per  Moore  etal.,  1998),  and  total  cell 
density  was  4.15x10®  by  Sybr-stained  flow  cytometric 
counts.  Prochhrococcus  strains  MED4,  M1T9515,  MIT9312 
and  MIT9313  were  separately  spiked  into  the  natural 
samples,  each  to  final  cell  concentrations  ranging  from  -10’ 
to  10®  cells  ml"'.  Culture  cell  concentrations  were  measured 
using  flow  cytometry,  and  necessary  dilutions  of  cultures 
were  made  with  0.2-pm-filtered  Sargasso  Sea  water.  Each 
aliquot  of  Woods  Hole  water  with  Its  spIked-in  Prochlorococ- 
cus  was  filtered  through  a  GF-A  prefilter,  then  collected 
on  a  Supor-200  (#60300,  Pall  Corporation,  Ann  Arbor.  Ml) 
0.2  pm  filter,  using  a  MasterFlex  peristaltic  pump  system 
(Cole-Parmer  Instrument  Company,  Vernon  Hills,  IL).  Filtered 
volumes  ranged  from  250  ml  to  1  I.  Filters  were  immediately 
frozen.  All  additions  and  filtratlons  were  made  within  24  h  of 
water  collection. 

Extractions  were  a  modification  of  a  filter  extraction  proto¬ 
col  described  previously  (Suzuki  etai,  2001).  Filters  were 
transferred  to  2.0  ml  screw-top  microcentrifuge  tubes,  and 
242  pi  of  lysis  buffer  was  added  to  each  [lysis  buffer:  40  mM 
EDTA,  50  mM  Tris  pH  8.3,  0.73  M  sucrose,  1.15  mg  ml"' 
lysozyme  (Sigma,  #L-6876),  200  pg  mL'  RNase  (Qiagen, 
Valencia,  CA,  #1018048),  0.2  pm-fifter-sterilized].  Samples 
were  incubated  at  37''C  for  30  min,  rotating.  In  total,  13.5  pi  of 
a  Proteinase  K  solution  [10  mg  ml"'  (EMD,  #24568-2)  In 
40  mM  EDTA,  50  mM  Tris  pH  8.3,  0.73  M  sucrose]  was 
added,  and  SDS  was  added  to  a  final  concentration  of  1%. 
Each  sample  was  Incubated  at  55X,  rotating,  overnight.  The 
samples  were  then  extracted  with  the  DNeasy  96  Tissue  kit 
(Qiagen,  Valencia,  CA),  by  a  modification  of  the  manufactur¬ 
er’s  protocol.  Each  tube  received  300  pi  of  Buffer  AL  (buffer 
AL/E  without  ethanol  added),  was  vortexed,  and  incubated 
for  70‘'C  for  1 0  min.  Then  300  pi  of  99%  ethanol  was  added 
to  each,  they  were  vortexed,  and  pipetted  onto  the  96-well 
spin  plate.  The  plate  was  sealed  with  the  Airpore  sheets 
(supplied  with  kit)  and  spun.  All  spins  were  carried  out  at 
40°C,  4612  g  in  a  Sorvall  Legend  RT  centrifuge  (Kendro 
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Laboratory  Products,  Newtown,  CT).  The  plate  was  spun 
10  min,  500  pi  Buffer  AWI  was  added  to  each  well,  and  the 
plate  re-sealed  and  spun  5  min.  In  total,  500  pi  Buffer  AW2 
was  then  added  to  each  well,  and  the  plate  re-sealed  and 
spun  5  min.  To  dry  the  plate,  the  column  portion  was  then 
transferred  to  a  new  rack  of  elution  microtubes  RS  (supplied 
with  kit)  and  incubated  for  15  min  at  70®C.  To  elute,  200  pi 
Buffer  AE  preheated  to  70 ’C  was  then  added  to  each  well, 
the  plate  was  re-sealed,  incubated  1  min  and  spun  for  2  min. 
The  elution  was  repeated  with  an  additional  200  pi.  The 
eluted  DNA  was  then  concentrated  using  Excela-Pure 
96-well  PCR  purification  kits  (Edge  BioSystems,  Gaithers¬ 
burg.  MD).  following  the  manufacturer’s  protocol.  Each  well 
was  rinsed  once  with  100  pi  nuclease-free  water  (#9937, 
Ambion,  Austin,  TX),  then  resuspended  in  20  pi  dilute  TE 
(1  mM  Tns  pH  8,  0.1  mM  EDTA  pH  8).  transferred  to  a  clean 
96-well  plate,  and  stored  at  -20®C.  Concentrations  were 
measured  by  Nanodrop. 

Microarray  probe  design 

Microarray  70-mer  probes  were  designed  using  the  program 
ArrayOligoSelector  (Zhu  efa/.,  2003)  with  the  following  set¬ 
tings:  target  %GC  =  40%.  1  probe/gene,  with  the  ORFs  for 
each  genome  fragment  as  both  the  input  and  the  database 
file.  The  output  candidate  70-mers  were  then  sorted  based  on 
their  %GC  and  those  closest  fo  40%  were  chosen.  In  the 
case  of  more  than  the  target  number  of  probes  having  40% 
GC  the  subset  with  the  lowest  tree  energy  of  hybridization 
were  selected  as  probes.  Generally.  20  probes  were  selected 
per  organism.  Prochiorococcus  MED4  was  represented  by 
60  probes  total,  20  each  for  three  different  80  kb  ‘genome- 
proxy’  regions:  0-80  kb,  1.29-1.37  Mbp,  and  1.58-1.66  Mbp. 

Using  the  same  method,  a  set  (n  =  20)  of  positive  control 
probes  were  designed  to  the  genome  of  the  halophillic 
archaeon  H.  salinarum  NRC-1 .  Negative  control  probes 
(n=28)  were  designed  to  a  set  of  49  random  1000-base 
sequences  (Stothard,  2000).  All  probes  sequences  and 
specifications  are  available  online  in  the  Gene  Expression 
Omnibus  (GEO). 

Probe  and  target  comparisons  in  silico 

Targets  were  compared  in  silico  In  several  ways.  Target  relat¬ 
edness  was  measured  for  closely  related  organisms  (for 
example  within  the  SAR86  or  Prochiorococcus  clades)  by 
both  1 6S  rRNA  gene  identity  and  ANI.  16S  gene  identity  was 
calculated  using  the  Distance  Matrix  (DNADist  format,  Jukes 
Cantor  corrected)  feature  of  the  Ribosomal  Database  Project 
(Cole  etal.,  2007).  Genomic  Identity  of  related  genomes  and 
genome  fragments  was  calculated  as  ANI,  as  described  by 
Konstantinidis  and  Tiedje  (2(X)5). 

Microarray  construction  and  hybridization 

Oligonucleotides  were  synthesized  (lllumina,  San  Diego, 
CA).  suspended  in  3x  SSC  to  a  concentration  of  40  pmol 
and  spotted  on  homemade  poly-L-lysine-coated  glass  slides 
using  a  QArray  2  microarraying  robot  (Genetix,  Hampshire, 
UK).  Six  replicates  of  each  probe  were  spotted. 


For  the  experiments  shown,  target  DNA  was  amplified 
and  labelled  using  A/B/C  random  amplification  (Wang 
etal.,  2003),  with  the  modification  that  the  initial  reverse 
transcription  step  was  omitted.  Briefly,  random-primed 
amplification  was  carried  out  in  three  reactions:  round  A 
used  Sequenase  to  extend  primer  A  (GTT  TCC  CAG  TCA 
CGA  TCN  NNN  NNN  NN);  round  B  used  20  rounds  of  PCR 
to  amplify  the  resulting  fragments,  using  primer  B  (GTT 
TCC  CAG  TCA  CGA  TC);  and  round  C  used  10  rounds  of 
PCR  to  incorporate  amino-allyl-deoxyuridine  tnphosphates 
(aa-dUTP).  For  the  environmental  samples,  the  amount  of 
DNA  into  each  reaction  was  normalized  to  represent  70  ml 
of  filtered  seawater.  All  A/B/C  reactions  were  performed  in 
triplicate  and  pooled.  Amplification  products  were  cfeaned 
using  a  Microcon  YM-30  and  concentrated  to  9  pi  in 
nuclease-free  water,  and  labelled  with  Cy3  by  combining 
8  pi  aa-DNA.  2  pi  0.5  M  NaHC03  and  5  pi  Cy3  dye  (33  pg 
in  DM  SO),  and  incubating  at  room  temperature  in  the  dark 
for  1  h.  Samples  were  cleaned  in  a  Microcon  YM-30.  con¬ 
centrated  to  19  pi  in  TE,  and  17.33  pi  was  added  to  hybrid¬ 
ization  buffer  for  final  concentrations  of  3x  SSC.  0.2%  SDS, 
0.4  mg  ml"’  poly  A.  0.02  M  Hepes,  pH  7,  in  a  final  volume  of 
25  pi.  Samples  were  denatured  4  min  at  100X,  then  pipet¬ 
ted  onto  the  arrays.  Arrays  were  hybridized  overnight  in  a 
heating  oven  (Model,  2000  Micro  Hybridization  Incubator. 
Robbins  Scientific,  Sunnyvale,  CA),  then  washed,  first  vig¬ 
orously  for  30  s  in  0.6x  SSC,  0.03%  SDS,  and  second  in 
0.06x  SSC  vigorously  for  30  s  then  gently  for  5  min.  For 
the  data  shown  in  this  paper,  hybridizations  were  carried 
out  at  65®C  and  washes  were  performed  at  room 
temperature. 

Microarray  data  analysis 

Hybridized  arrays  were  scanned  using  an  Axon  Instruments 
4000B  scanner  (Foster  City,  CA),  and  the  data  were  nor¬ 
malized  and  filtered  using  perl  scripts  written  for  the 
purpose,  by  the  following  steps,  (i)  Signal  intensities  for 
each  spot  were  calculated  by  subtracting  the  local  back¬ 
ground  (mean  F532  -  median  B532,  as  calculated  by 
GenePix  Pro  5.1  software.  Axon  Instruments),  (ii)  The 
median  value  across  replicates  was  calculated  for  each 
probe,  (iii)  For  each  probe  set,  the  number  of  probes 
greater  than  twice  the  mean  negative  control  signal  was 
calculated,  before  further  processing  (iv)  Filter  I:  Arrays 
with  less  than  half  their  positive  control  probes  exceeding 
twice  the  mean  negative  control  signal  were  considered 
poor  quality,  low  dynamic  range,  arrays  and  were  excluded 
from  further  analysis,  (v)  Each  probe  signal  was  corrected 
for  non-specific  binding  by  subtracting  the  mean  negative 
control  spot  signal.  (VI)  The  data  were  then  normalized  for 
array-to-array  variations  in  brightness  by  dividing  each 
probe  signal  by  the  mean  positive  control  signal.  This  posi¬ 
tive  control  signal  was  the  mean  signal  across  the  H.  sail- 
narum  probes  in  each  hybridization,  with  identical  amounts 
of  H.  salinarum  DNA  having  been  added  to  each  reaction 
prior  to  amplification  and  labelling.  (VII)  Filter  fl:  In  order  for 
a  genotype  to  be  considered  ‘present’,  at  least  45%  of  its 
probes  had  to  exceed  twice  the  mean  negative  control 
signal.  (VIII)  Finally,  each  genotype  signal  was  calculated  as 
either  the  mean  or  Tukey  Biweight  across  its  probe  set. 
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The  Tukey  Biweight  was  calculated  as  follows  (Affymetrix, 
2002).  For  n  probes  in  a  given  probe  set,  the  individual  probe 
values  are  x,,  x^, .  . . ,  x^,  after  earlier  pre-processing  steps,  m 
is  the  median  of  these  values  for  a  given  probe  set. 
MAD  =  weighted  median  of  these  values  =  median  (Ixr  -  ml, 
1x2  -  ml, ....  IXn  -  ml).  For  each  probe,  its  distance  from  the 
centre  is  calculated  as  t,=  (x,-  m)/(5*MAD  +  0.0001 );  where 
/  =  1 , 2, . . . ,  n.  Weights  for  each  probe  value  then  are  calcu¬ 
lated  by  the  bisquare  function,  B(0  =  {1  -  ff  for  Ifl  <  1,  or 
B{()  =  0  for  I/I  =  1 .  Then  the  Tukey  Biweight  (TBW)  can  be  for 
the  probe  set  as  a  whole  across  n  probes  with  values  Xr, 

X2,  . .  .  ,  x„:  TBW{xu  X, . x„)  =  (Z„ ,  B{t)  {x, )/[!,.  r  B(t)l. 

In  each  experiment,  four  metrics  of  evenness  were  calcu¬ 
lated  for  each  probe  set.  These  were  the  Shannon’s  index  of 
evenness,  the  Simpson’s  index  of  evenness,  the  Simpson’s 
modified  index  of  evenness  (all  Magurran,  1988)  and  the  CV. 
They  were  calculated  as  follows: 

Shannon’s  index  of  evenness 

EshMnncn=[-'ZP>  ln(p,  )]/ln(n) 

As  above,  n  =  the  number  of  probes  in  the  probe  set.  pi  =  x/X, 
where  X  is  the  summed  signal  across  all  probes  In  that  set. 
Simpson’s  index  of  evenness: 

faw»'  =  [l/XPn/^ 

Simpson’s  modified  index  of  evenness: 

=  [1/Xx,  (X,  -  1)/X(X  -  1)]/n 

Coefficient  of  variation 

CV=  [Vn*5^(x,-a)^]°ya 

Where  a  is  the  mean  value  of  x  across  each  probe  set. 

Finally,  for  the  Prochlorococcus  addition  experiment  only, 
outlier  arrays  were  identified  as  having  normalized  mean 
positive  control  signal  less  than  25%  of  the  average  across 
the  experimental  series,  and  were  excluded  from  further 
analyses. 

For  all  experiments,  pre-processed  data  were  imported  into 
Excel  for  visualization,  and  the  raw  data  are  available  online 
at  GEO. 

Dafa  deposition 

The  sequences  of  the  environmental  clones  EB000_55B11 
and  EF100_57A08  have  been  deposited  in  GenBank  under 
Accession  Nos.  EU221 238-9.  Microarray  data  are  MIAME 
compliant  and  have  been  deposited  in  GEO  under  platform 
Accession  No.  GPL6012. 
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Supplementary  material 

The  following  supplementary  material  is  available  for  this 
article  online: 

Fig.  SI .  Adjusting  the  stringency,  in  silico. 

a.  The  Tukey  Biweight-normalized  signal  of  all  probes  sets  on 
the  array,  from  one  representative  hybridization  series  (data 
not  comparable  between  series  because  of  different  salt 
concentrations  and  temperatures  used)  of  the  experiment 
described  in  Fig.  4,  whose  mean  normalized  signals  are 
shown  in  Rg.  5.  Note  that  the  other  genotypes  present  by  the 
mean  are  still  present  by  the  more  stringent  Tukey  Biweight. 

b.  Focusing  just  on  the  data  for  the  MED4  probe  set  (red  bar 
in  panel  a).  Using  Tukey  Biweight,  cross-hybridization  to 
related  strains  is  virtually  eliminated. 

Fig.  S2.  The  pattern  of  hybridization  across  the  Prochloro- 
coccus  MED4  probe  set  for  fhe  cell  addition  experiment  fo  a 
natural  seawater  community,  (a)  Approximately  10^  cells  ml"' 
of  each  strain  MED4,  MIT9515,  MIT9312  and  MIT9313; 
(b)  -10®  cells  ml  ’  of  MED4  and  -10^  cells  mP  of  MIT9515, 
MIT9312  and  MIT9313;  (c)  -10®  cells  ml  '  of  MED4;  (d)  -10® 
cells  ml"'  of  MIT9515  and  -10®  cells  ml"’  of  MED4,  MIT9312 
and  MIT9313:  (e)  -10®  cells  ml"'  of  MIT9515;  (f)  -10® 
cells  ml"'  of  MIT9312  and  -10®  cells  ml"'  of  MED4,  MIT9515 
and  MIT9313;  (g)  -10®  cells  mM  of  M1T9312;  and  (h)  -10® 
cells  ml"’  ot  MIT9313. 

Fig.  S3.  Testing  a  Virtual’  strain  with  a  higher  identity  to 
target  strain  MED4,  created  by  using  the  17  probes  (of  60 
total)  with  BLAST-based  identifies  higher  than  90%  to  strain 
MIT9515.  Their  ANI  to  MIT9515  was  92.4%. 

a.  Across  a  range  of  cell  concentrations,  fhe  mean  signal  from 
these  higher -Identity  probes  is  intermediate  to  that  of  the 
whole  probe  set-based  signal  of  MED4  and  MIT9515.  Also, 
see  inset  to  Fig.  4b  for  the  correlation  between  signaf  and 
genomic  identity. 

b.  The  Tukey  Biweight  signal  across  these  probes  Is  also 
intermediate  between  the  whole-set  signals  for  MED4  and 
MIT9515. 

This  material  is  available  as  part  of  the  online  article  from 
http://www.blackwell-synergy.com 
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Chapter  3 


Time-series  investigation  of  a  coastal  microbial  community  in  Monterey  Bay,  CA, 
using  the  “genome  proxy”  microarray. 
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Abstract 


Coastal  marine  microbial  communities  are  dynamic  assemblages,  inhabiting 
spatially  and  temporally  variable  environments.  To  gain  improved  temporal, 
spatial  and  phylogenetic  resolution  of  the  microbial  communities  in  Monterey 
Bay,  we  used  an  expanded  “genome  proxy  array”  (an  oligonucleotide  microarray 
targeting  marine  microbial  genome  fragments  and  genomes)  to  profile  a  total  of 
57  samples  over  4  years.  Samples  derived  from  Om  (photic),  30m  (base  of  the 
surface  mixed  layer),  and  200m  (subphotic)  habitats  were  hybridized  to  the  array, 
along  with  a  single  depth  profile  from  Hawaii  for  comparative  purposes.  The 
updated  array,  which  targeted  268  genotypes  (vs.  14  in  the  prototype),  was 
cross-validated  using  pyrosequence  data  from  three  samples.  The  taxa 
abundances  measured  by  the  two  methods  were  highly  correlated  (linear 
regression  with  R^=0. 85-0.91  for  the  three  samples).  The  strongest  differences 
among  sample  profiles  were  observed  between  the  shallow  (Om  +  30m)  and 
deep  (200m)  samples,  with  a  number  of  depth-specific  taxa  distributions  driving 
these  differences.  Depth-specific  array  profiles  were  also  evident  in  the  Hawaii 
samples,  although  the  photic  zone  taxa  present  were  different  between  the  two 
locations.  Although  Monterey  Bay  is  dominated  by  strong  seasonal  upwelling,  the 
sample  profiles  within  each  depth  did  not  cluster  based  on  sample 
“oceanographic  season”  {sensu  Pennington  et  al.,  2007).  However,  the 
abundance  of  the  most  dominant  genotypes  did  correlate  to  strong  episodic 
upwelling  events.  Genotypes  representing  common  marine  photo-  and 
heterotroph  clades,  the  majority  of  which  are  uncultivated,  were  observed  in  both 
shallow  and  deep  samples,  including  the  ubiquitous  Pelagibacter  dade,  SAR86, 
OM42,  OM43,  NAC11-7,  CHAB1-5,  SAR1 16,  SAR324,  SAR406,  OM60,  ZD0417, 
Arctic96BD-19,  and  the  G1  and  G2  marine  archaea.  Most  showed  strong  depth- 
specific  distributions  consistent  with  their  previously-documented  16S-clone 
library  and  FISH-based  distributions.  Nutrient  concentrations  were  strongly 
correlated  to  overall  array  profile  variance,  driven  by  the  strong  oceanographic 
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differentiation  of  the  three  sample  depths,  and  finer-scale  within-depth  analyses 
linked  several  diverged  array  profiles  to  correlated  nutrient  profiles.  The 
population  structure  of  deeper  taxa  was  more  variable  than  that  of  shallow  taxa, 
and  sporadic  taxa  were  more  variable  than  common  taxa.  Specific  population 
shifts  were  evident  in  several  abundant  target  taxa,  with  populations  in  some 
cases  clustering  by  depth  or  oceanographic  season  and  in  others  apparently 
ecologically  neutral  for  the  sample  designations  examined.  This  multi-year 
community  survey  showed  the  consistent  presence  of  a  core  group  of  common 
and  abundant  targeted  taxa  at  each  depth  in  this  location,  higher  variability 
among  shallow  than  deep  samples,  and  episodic  occurrences  of  other  targeted 
marine  genotypes. 


Introduction 

Marine  microbial  communities  have  garnered  much  attention  in  recent 
years,  as  major  active  participants  in  biogeochemical  cycling  (Arrigo,  2005, 
Howard  et  al.,  2006,  Karl  etal.,  2007),  and  due  to  novel  metabolic  discoveries 
(e.g.  Beja  et  al.,  2000b,  Dalsgaard  et  al.,  2003,  Kuypers  et  al.,  2003,  Kolber  et 
al.,  2000),  and  metagenomic  surveys  beyond  the  scale  of  those  undertaken  in 
other  habitats  (Venter  et  al.,  2004,  Tringe  et  al,  2005,  Delong  et  al.,  2006 
(Appendix  4),  Kennedy  et  al.,  2007,  Rusch  et  al.,  2007,  Yooseph  et  al.,  2007, 
Wegley  et  al,  2007,  Wilhelm  et  al.,  2007,  Dinsdale  et  al.,  2008  a  and  b,  Mou  et 
al.,  2008,  Neufeld  et  al,  2008,  Marhaver  et  al.,  2008).  The  marine  realm  makes 
up  >99%  of  the  available  habitat  on  the  planet,  with  its  inhabitants  comprising  the 
bulk  of  the  planet’s  biomass  and  diversity.  In  spite  of  this  importance  and  growing 
attention,  the  marine  microbial  world  remains  incompletely  understood  due  to  the 
technical  challenges  of  studying  its  vast  diversity  and  habitat  space.  As  with  most 
complex  biological  systems,  marine  microbial  systems  cannot  yet  be  modeled,  in 
that  their  ecological  and  evolutionary  units  and  defining  interactions  are  not 
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known  (although  see  the  promising  nascent  attempts  with  cyanobacteria  of 
Follows  et  a!.,  2007).  The  dynamism  of  these  communities  remains  poorly 
mapped:  the  majority  of  information  derives  from  spatiotemporal  snapshots,  or 
from  studies  focusing  solely  on  a  few  groups,  often  at  higher  phylogenetic 
resolutions  which  may  not  correspond  to  ecologically-relevant  biological  units. 

Interest  in  developing  a  time  series  perspective  on  marine  microbial 
systems  has  been  growing,  however,  and  methods  have  allowed  increasingly 
comprehensive  and  fine-scale  investigations.  Several  marine  Long  Term 
Ecological  Research  (LTER)  sites  have  incorporated  microbial  investigations, 
leading  to  new  insights  into  community  structure  over  time,  correlations  to 
environmental  parameters,  and  responses  to  change  (e.g.  Karner  et  al.,  2001, 
Morris  et  al.,  2005).  In  this  LTER  context,  particularly  noteworthy  marine 
microbial  time-series  investigations  have  occurred  (although  many  at  relatively 
coarse  phylogenetic  resolution),  at  the  Hawai’i  Ocean  Time-Series  (HOT)  (Karner 
et  al.,  2001,  Campbell  et  al.,  1997),  the  Bermuda  Atlantic  Times  Series  (BATS), 
(Steinberg  et  al.,  2001 ,  DuRand  et  al.,  2001 ,  Morris  et  al.,  2005,  McGillicuddy  et 
al.,  2007),  the  San  Pedro  Ocean  Time-Series  (SPOT)  (Fuhrman  etal.,  2006), 
and  the  Monterey  Bay  Microbial  Observatory  (MBMO)  within  the  Monterey  Bay 
National  Marine  Sanctuary  (Ward,  2005,  O’Mullan  and  Ward,  2005,  Mincer  et  al., 
2007). 

A  number  of  methods  exist,  each  with  strengths  and  weaknesses,  for 
tracking  microbial  community  members  (see  Chapter  1).  Community  genomic 
sequencing  may  be  the  optimal  tool  for  exploring  community  composition 
because  of  its  high  information  yield,  but  for  now  remains  financially  unfeasible 
for  sampling-intensive  investigations.  We  previously  described  the  “genome 
proxy"  array  (Rich,  Konstantinidis  and  DeLong,  2008)  which  used  sets  of  70-mer 
probes  to  target  14  genotypes  (genome  fragments  and  genomes).  The  array  was 
designed  to  cross-hybridize  to  related  genotypes  at  >  ~80%  average  nucleotide 
identity  (ANI,  as  in  Konstantinidis  and  Tiedje,  2005),  which  could  be  raised  to  > 
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-90%  ANI  by  tuning  the  analysis  in  silico.  In  addition,  related  cross-hybridizing 
strains  produced  distinct  hybridization  patterns  across  their  target  probe  set, 
which  could  reveal  shifts  in  population  structure  across  samples. 

Here,  we  developed  an  expanded  genome  proxy  array,  and  applied  it  to 
investigate  the  time  series  dynamics  of  the  268  targeted  clades  over  a  four-year 
period  at  Monterey  Bay  Station  M1  (36.747’  N,  122.022’  W),  a  well-studied 
coastal  environment  characterized  by  strong  seasonal  upwelling.  Photic  (Om)  and 
subphotic  (200m)  samples  from  24  time  points  spanning  -4  years,  and  samples 
just  below  the  mixed  layer  (30m)  from  13  time  points  over  -1.5  years,  were 
hybridized  to  the  array.  Array  data  were  cross-validated  by  comparison  to 
pyrosequencing  data  for  three  Om  samples.  The  array-based  organism  profiles 
for  57  samples  were  used  to  investigate:  (i)  genotype  differences  with  depth,  (ii) 
genotype  differences  between  Monterey  Bay’s  “oceanographic  seasons”  (sensu 
Pennington  et  ai,  2007)  (iii)  genotype  differences  associated  with  episodic 
upwelling  events,  (iv)  correlations  between  hybridization  profiles  and  nutrient 
concentrations  (nitrate,  nitrite,  phosphate,  silicate),  (v)  and  correlations  in  the 
distribution  of  genotypes  to  one  another. 


Methods 

Sampling  and  DNA  Extractions 

Samples  were  collected  from  Station  Ml  (36.747’  N,  122.022’  W)  in 
Monterey  Bay  periodically  (at  approximately  monthly  intervals,  with  several 
longer  gaps)  between  Julian  Day  (JD)  271  in  2000  and  JD167  in  2004.  2L  of 
seawater  from  each  of  eight  depths  (0,  20,  30,  40,  80,  100,  150  and  200m)  were 
filtered  through  a  45mm  GF-A  prefilter  (Whatman)  and  concentrated  onto  a 
25mm  Supor-200  0.2pm  filter  (Pall  Corp,  Ann  Arbor,  Ml),  using  a  MasterFlex 
peristaltic  pump  system  (Cole-Parmer  Instrument  Company,  Vernon  Hills,  IL). 
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Filters  were  stored  dry  in  2ml  screw-cap  tubes,  immediately  placed  in  a  -20 
degree  Celsius  freezer  shipboard,  and  transferred  on  ice  to  a  -80  degree  Celsius 
freezer  upon  landfall. 

All  MB  DNA  extractions  were  performed  simultaneously  in  96-well  format  to 
minimize  extraction  variability,  as  in  Rich,  Konstantinidis  and  Delong,  2008.  DNA 
was  extracted  from  all  Om  and  200m  filters  available  from  2000  JD271  through 
2004  JD167,  and  all  30m  samples  available  from  2000  JD271  through  2002 
JD070.  In  this  location,  Om  is  in  the  photic  zone,  30m  is  generally  below  the 
mixed  layer,  and  200m  is  below  the  photic  zone.  Extracted  DMAs  were  quantified 
spectrophotometrically  (Nanodrop,  Thermo  Scientific)  and  stored  at  -80  degree 
Celsius  until  use.  Yields  averaged  -470  ng  per  liter  of  seawater  for  200m 
samples  (range  1 77-903  ng)  and  -1460  ng  per  liter  of  seawater  for  Om  and  30m 
samples  (range  484-3804  ng). 

In  addition  to  Monterey  Bay  samples,  several  community  DNAs  from  the 
Hawaii  Ocean  Time  series  Station  ALOHA  were  hybridized  to  the  array.  These 
samples  were  collected  on  cruise  HOT179  in  March  of  2006  as  described  in 
Frias-Lopez  and  Shi  et  al.  (2008),  and  include  the  75m  DNA  sample  used  in  that 
study.  DNA  was  extracted  as  described  in  Frias-Lopez  and  Shi  et  al.  (2008). 

Oceanographic  Data 

Oceanographic  data  were  kindly  provided  by  Reiko  Michisaki  and  Francisco 
Chavez  of  the  Biological  Oceanography  Group  at  the  Monterey  Bay  Aquarium 
Research  Institute,  who  collected  and  processed  it  as  part  of  the  Monterey  Bay 
time  series  program.  Measurement  methods  were  described  in  Asanuma  etal., 
1999. 

Arrays  Design,  Hybridization,  and  Data  Processing 

The  expanded  genome  proxy  array  was  designed  as  in  Rich,  Konstantinidis 
and  DeLong,  2008,  with  a  broader  scope  (268  target  genotypes,  as  opposed  to 
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the  prototype’s  14)  and  the  addition  of  a  co-spot  oligo  for  spot  alignment  and 
gridding  purposes  (the  “alien”  sequence  used  in  Urisman  etal.,  2005).  The 
targets  were  selected  from  fully-sequenced  marine  microbial  genomes,  publicly- 
available  marine-derived  BAG  and  fosmid  clone  sequences,  and  fully-sequenced 
clones  from  the  lab’s  Monterey  Bay  and  Hawai’i  environmental  BAG-  and  fosmid- 
based  genomic  libraries.  Targeted  genotypes  are  detailed  in  Table  1, 
summarized  in  Table  2,  and  presented  in  a  schematic  phylogenetic  overview  in 
Figure  1 .  Previously-unpublished  sequences  used  for  array  design  were 
submitted  to  Genbank  under  accession  numbers  XXX-XXX. 

For  each  sample,  at  least  three  replicate  arrays  were  hybridized.  For 
samples  in  which  one  or  more  of  the  arrays  showed  significant  surface  peeling  or 
excessive  background  fluorescence,  additional  arrays  were  hybridized. 
Hybridizations  were  performed  as  in  Rich,  Konstantinidis  and  DeLong,  2008,  with 
the  following  modifications:  Round  A,  B  and  G  reactions  were  performed  in  96 
well  plates  for  higher  throughput,  and  cleaned  through  ExcelaPure  96-well  plates 
(Edge  Biosystems,  Gaithersburg).  1  pmol  of  Gy5-labeled  co-spot  complement 
oligo  was  added  to  each  hybridization  for  spot  localization  purposes  (modified 
from  Urisman  et  al.,  2005). 

Data  were  pre-processed  as  in  Rich,  Konstantinidis  and  DeLong,  2008,  with 
minor  modifications.  Briefly,  poorly-performing  arrays,  defined  as  those  with  less 
than  half  the  positive  control  probes  brighter  than  the  standard  deviation  of  the 
negative  control  probes,  were  removed  from  further  analysis.  Within  each 
remaining  array,  bad  spots  (those  with  areas  of  poly-L-lysine  peeling  or 
excessive  background  fluorescence)  were  manually  flagged  and  removed  from 
further  analysis.  Background-subtracted  spot  intensities  were  negative-control- 
subtracted  and  normalized  to  each  array’s  mean  positive  control  value,  then 
replicate  spots  of  a  given  probe  were  pooled  across  arrays  and  the  median  was 
taken  as  the  value  for  that  probe.  For  each  organism,  the  mean  or  tukey  biweight 
(TBW)  across  each  probe  set  was  taken,  as  in  Rich,  Konstantinidis  and  DeLong, 
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2008,  with  an  improvement  in  the  subsequent  thresholding  step  for  each 
organism,  as  follows.  At  least  40%  of  each  organism’s  probes  were  required  to 
be  above  the  standard  deviation  of  the  negative  control  probe  set  (rather  than 
above  twice  the  mean  negative  control  value,  as  previously),  or  else  the 
organism  was  considered  “absent”  and  its  value  set  to  zero.  This  was  done  to 
remove  erroneous  organism  abundances  due  to  uninformative  single-gene 
cross-hybridizations. 

Array  platform  design  and  hybridization  data  were  deposited  in  the  Gene 
Expression  Omnibus,  under  GEO  Accession  numbers  XXX  and  XXX-XXX, 
respectively. 

Data  Analyses 

Clustering  analyses  of  sample  hybridization  data  were  performed  in 
GenePattern  (Reich  etai,  2006),  using  hierarchical  clustering  (Eisen  et  al.,  1998) 
by  Pearson  correlations  for  both  rows  and  columns,  using  pairwise  complete- 
linkage,  and  without  row  or  column  centering.  Marker  Prediction  was  performed 
in  GenePattern.  Principal  component  analyses  (PCA)  was  performed  in  both 
GenePattern  and  in  R  using  the  prcomp  function.  Canonical  discriminant 
analyses  (CDA)  were  performed  in  R  with  the  candisc  function.  In  order  to  keep 
the  number  of  variables  less  than  the  number  of  responses  (i.e.,  samples),  CDA 
was  performed  using  the  top  28  principal  components  instead  of  all  detected 
organisms.  Correlations  were  calculated  between  environmental  parameters  or 
organism  abundances  and  each  plotted  principal  component  or 
canonical  discriminant  axis.  The  relative  values  of  the  correlations  were 
represented  as  vectors  on  the  analysis  graphs. 

Arrav-vs-pvroseauencing  Comparisons 

Three  samples  were  chosen  for  parallel  pyrosequencing  and  array 
hybridization,  based  on  their  DNA  yield.  Approximately  3pg  each  of  samples 
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2000  JD298,  2001  JD115  and  2001  JD135  were  sequenced  at  the  Schuster  Lab 
pyrosequencing  facility  (Penn  State  University)  on  a  454  sequencer. 

Sequence  Clean-Up:  To  remove  poor  quality  sequences,  the  length 
distribution  of  the  raw  pyrosequencing  reads  for  each  sample  was  plotted.  From 
the  empirical  cumulative  density  function  (ecdf)  plot,  the  lower  and  upper 
boundary  lengths  were  estimated  so  that  95%  of  the  read  lengths  fell  between 
the  boundaries  (which  varied  for  each  sample:  71  and  305bp  for  2000JD298,  65 
and  255bp  for  2001JD1 15,  and  65  and  303  bp  for  2001JD135).  The  outlying  5% 
of  the  reads  were  removed.  Furthermore,  reads  with  more  than  one  “N”  were 
also  removed.  This  two-step  process  removed  approximately  5.5%  of  the  reads 
overall;  for  2000JD298,  23917  out  of  419684  reads  (5.7%)  were  discarded,  for 
2001JD1 15,  19822  out  of  365472  reads  (5.4%)  were  discarded,  and  for 
2001JD135,  22887  out  of  414861  reads  (5.5%)  were  discarded. 

BLASTN  parameters:  To  identify  BLASTN  parameters  that  would  give  the 
closest  in  silico  similarity  to  the  array’s  range  of  cross-hybridization,  we  used  the 
genomes  of  Prochlorococcus  MED4,  MIT9515,  and  MIT9312,  whose  relative 
hybridization  strength  to  the  array’s  strain  MED4  probes  was  measured 
previously  (Rich,  Konstantinidis  and  DeLong,  2008).  The  genomes  were 
fragmented  into  overlapping  (tiled)  100-bp  fragments  using  a  perl  script  (kindly 
provided  by  G.  Tyson),  and  each  set  of  fragments  was  BLASTed  against  the 
MED4  genome  to  compare,  for  varying  parameters,  the  self-self  results  (MED4  to 
MED4,  100%  identity),  MIT9515  to  MED4  (86%  average  genomic  identity, 
calculated  as  in  Konstantinidis  and  Tiedje,  2005),  and  MIT9312  to  MED4  (78.5% 
average  genomic  identity).  The  following  combinations  of  command-line  BLASTN 
parameters  were  tested:  1)X150  q-1  r1  W7  FF,  2)X30  q-3  r1  W7  FF,  3)X30  q-5  r1 
W7  FF,  4)X30  q-5  r2  W7  FF,  and  5)X30q-7r2W7FF,  among  which  the  first 
parameter  set  yielded  the  best  separation  of  MED4-MIT9515  and  MED4- 
MIT9312  distribution  of  hits,  and  was  subsequently  used  in  downstream 
analyses. 
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Parsing  parameters:  BLASTN  hits  to  a  given  target  were  parsed  by  bit 
score.  However,  because  pyrosequencing  reads  range  in  lengths,  and  read 
length  effects  bit  score,  we  investigated  the  correlation  between  read  length  and 
bit  score  for  MIT9515  fragments  versus  MED4  and  for  MIT9312  fragments 
versus  MED4.  In  addition  to  tiled  100-bp  fragments,  tiled  50-bp,  75-bp,  and  125- 
bp  fragments  were  also  generated.  Linear  equations  for  bit-score  (y-axis)  versus 
read  length  (x-axis)  were  determined.  The  MED4-MIT9312  slope  was  smaller 
than  that  of  MED4-MIT9515,  due  to  the  lower  average  identity  involved  at  any 
given  read  length.  Since  cross-hybridization  at  or  above  the  MIT9515-MED4  level 
of  identity  dominates  the  signal  of  the  microarray  (Rich,  Konstantinidis  and 
DeLong,  2008),  the  equation  for  that  comparison  was  used  to  adjust  the  bit  score 
cutoff  to  the  read  length  for  each  individual  read. 

Monterey  Bay  pyrosequencing  versus  array  comparison:  Using  the 
BLASTN  parameters  and  parsing  criteria  optimized  above,  the  pyrosequencing 
reads  from  each  sample  were  BLASTed  against  all  268  genomes  and  genome 
fragments  to  which  the  array  was  targeted.  Reads  were  assigned  to  (a.k.a. 
recruited  to)  one  or  more  array  targets,  proportional  to  their  bitscore,  to  mimic  the 
cross-hybridization  permitted  by  the  array.  Thus,  if  1  read  matched  three  targets 
using  the  criteria  outlined  above,  then  it  would  be  assigned  to  the  first  of  those 
targets  as  1  *  (bitscorel  /  (bitscorel  +  bitscore2  +  bitscore3)),  to  the  second  as  1* 
(bitscore2  /  (bitscorel  +  bitscore2  +  bitscore3)),  etc..  The  read-based  abundance 
of  each  array  target  was  then  normalized  to  the  length  of  the  target  query,  and  to 
the  database  size,  and  compared  to  the  unthresholded  array  signal  (that  is,  the 
signal  for  each  organism  before  requiring  at  least  40%  of  its  probes  to  be  above 
the  described  threshold)  of  the  same  clone. 


Results 

Development  of  the  Expanded  Genome  Proxy  Array 
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The  expanded  genome  proxy  array  targeted  268  organisms,  through  suites 
of  probes  (n=20  per  target,  in  general)  dispersed  along  genomes  and  genome 
fragments  derived  from  marine  habitats.  Targeted  organisms  were  selected  to 
span  known  marine  microbial  diversity  (Figure  1,  Tables  1  and  2,  and  Figures 
S1-S5).  For  particularly  diverse  and  abundant  clades  (e.g.  the  marine 
cyanobacteria  Prochlorococcus  and  Synechococcus),  representatives  were 
chosen  where  possible  from  each  major  pelagic  coastal  and  open-ocean  lineage. 
Of  the  268  organisms  represented  on  the  array,  42.5%  were  clones  derived  from 
HOT,  26.5%  were  marine  microbial  genomes  isolated  from  a  variety  of  locations, 
19%  were  clones  derived  from  Monterey  Bay,  and  1 1.9%  were  other  marine- 
derived  clones  available  in  Genbank  (Figure  S6a). 

Ground-truthing  the  array 

To  rigorously  evaluate  our  new  expanded  genome  proxy  array,  we  sought 
to  compare  pyrosequencing  and  array  data  for  each  of  three  Monterey  Bay 
samples  (Om  from  2000  JD298,  2001  JD1 15  and  2001  JD135).  Sequencing 
produced  an  average  of  400,000  reads  per  sample,  which  were  trimmed  to 
remove  poor  quality  sequence  (-5.5%  of  reads),  then  “hybridized”  in  silico  using 
BLAST  (Altschul,  1990)  to  genotypes  targeted  by  the  array.  BLAST  parameters 
were  trained  using  genomes  of  Prochlorococcus  strains  whose  relative  cross¬ 
hybridization  to  the  array  had  been  previously  investigated  (Rich,  Konstantinidis 
and  DeLong,  2008),  in  order  to  simulate  the  amount  of  target  divergence 
tolerated  by  the  array.  The  sampling  depth  of  the  pyrosequencing  data  was 
insufficiently  deep  to  meaningfully  examine  the  evenness  of  BLAST  hits  to  each 
target  (that  is,  their  distribution  across  the  target  sequence),  whereas  such 
filtering  is  performed  during  array  data  analysis  (by  requiring  >40%  of  a  target 
probe  set  to  show  above-threshold  signal  to  consider  that  target  “present”). 
Therefore,  unfiltered  array  data  were  compared  to  pyrosequencing  data  for  each 
sample. 
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The  normalized  pyrosequencing  read  recruitment  was  strongly  correlated  to 
the  normalized  unfiltered  mean  array  intensity  (Figure  2;  linear  regression  with 
of  0.91  for  sample  2000  JD298,  0.88  for  2001  JD1 1 5,  and  0.85  for  2001  JD1 35). 
Such  strong  correlation  between  relatively  unbiased  and  comprehensive 
pyrosequencing,  and  the  high-throughput,  inexpensive  genome  proxy  array, 
supports  the  array’s  utility  as  a  tool  for  profiling  studies  requiring  high  sample 
throughput. 

Exploring  microbial  communities  using  the  genome  proxy  array 

Target  derivation  versus  presence:  57  samples  from  Monterey  Bay  (location 
and  relative  times  and  depths  of  samples  indicated  in  Figure  3b),  and  4  samples 
from  Hawaii  were  hybridized  to  the  array.  Targeted  coastal  genotypes  were 
enriched  in  Monterey  Bay  samples,  and  targeted  open  ocean  genotypes  were 
enriched  in  Hawaii  samples.  That  is,  74.1%  of  all  target  genotype  signal  in  57  MB 
samples  were  from  MB-derived  clones  (Figure  S6b),  a  ~2.5  fold  enrichment 
relative  to  their  representation  on  the  array.  Alternately,  59.3%  of  all  target  signal 
in  4  Hawaii  samples  were  from  Hawaii-derived  targets,  1.39-fold  their 
representation  on  the  array  (Figure  S6c). 

Shallow  versus  deep  genotypes:  Hierarchical  clustering  was  used  to 
investigate  community  depth  partitioning.  By  Pearson  correlation-based 
clustering  of  the  Monterey  Bay  samples,  all  200m  (sub-photic  zone)  samples 
clustered  together  to  the  exclusion  of  the  shallower  samples  (Om  photic  zone, 
and  30m  below  the  mixed  layer  sample;  Figure  4a).  Likewise,  among  four  Hawaii 
samples,  hierarchical  clustering  followed  depth  (Figure  4b). 

Principal  component  analysis  (PCA)  of  the  Monterey  Bay  hybridization 
profiles  also  supported  a  clear  separation  of  shallow  and  deep  samples  (Figure 
5),  with  a  slight  additional  separation  of  the  Om  and  30m  samples.  The  first  two 
principal  components  account  for  >90%  of  the  data’s  variability,  and  clearly 
delineate  the  shallow  and  deep  clusters,  recapitulating  previously-observed 
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microbial  community  stratification  along  the  depth  gradient. 

The  majority  of  targeted  taxa  showed  differential  distributions  between 
shallow  and  deep  samples,  in  both  Monterey  Bay  (0  and  30m  versus  200m; 
Figure  4a)  and  Hawaii  (25,  75  and  125m  versus  500m;  Figure  4b)  samples.  In 
Hawaii,  500m  taxa  never  occurred  in  the  shallow  sample,  and  vice  versa.  In  the 
much  more  extensive  Monterey  Bay  dataset,  there  were  three  notable  target 
clusters  with  particularly  strong  depth-specific  signals  (red-dashed  boxes  in 
Figure  4a).  The  first  cluster  comprised  8  target  genotypes  that  were  abundant 
and  consistently  present  in  shallow  samples,  and  spanned  a  range  of 
phylogenetic  clades.  This  cluster  is  hereafter  referred  to  as  “shallow-consistenf , 
and  included  EB000_31A08,  EB000_45B06,  alpha_HTCC2255,  EB000_39F01, 
EB000_55B11,  EB080_L11F12,  EB080_L43F08,  and  EB080_L27A02.  A  second 
cluster  of  shallow  genotypes,  “shallow-frequenf,  encompassed  12  frequently- 
occurring  targets;  EB000_37F11,  EB080_L06A09,  EB000_36A07, 

EB000_46D07,  EB000_69G07,  EB000_39H12,  EB000_49D07,  EBAC_27G05, 
EB000_50A10,  HF0010_16H03,  Fe/ag/bac/er  HTCC1002  and  HTCC1062.  The 
third  cluster,  “deep-consistenf’ ,  represented  10  taxa  with  a  consistent  presence 
and  abundance  in  the  200m  samples:  EB000_36F02,  DeepAnt_EC39, 
EB750_10B11,  EB750_10A10,  HF4000_23L14,  EB080_L31E9,  EB080_L93H08, 
EF100_57A08,  EB750_01B07,  and  HF4000_08N17. 

Canonical  discriminant  analysis  was  used  to  further  examine  genotype 
distributions  with  depth.  Each  genotype  abundance  was  correlated  to  the  first  two 
canonical  discriminant  axes,  with  the  resulting  vector  length  a  measure  of  that 
genotype’s  influence  on  sample  variability  (Figure  6a).  By  this  analysis,  the 
targets  which  most  drove  the  separation  of  the  deep  from  the  shallow  samples 
were  EB750_01B07,  EB750_10B11  EB080_L31  E09,  and  HF4000_08N17,  a 
subset  of  the  deep-specific  organisms  discerned  in  the  above  clustering  analysis. 
For  Om  and  30m,  the  picture  was  more  complex,  and  included  taxa  not  identified 
as  dominant  signals  in  the  clustering  analysis.  EB080_L43F08,  EB000_39F01 , 
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ProMED4,  EB080_L27A02,  and  alpha_HTCC2255  drove  the  differentiation  of  Om 
from  30m.  while  EB000_39H12,  EBAC_27G05  and  EB000_65A11  drove 
differentiation  of  30m  from  Om. 

Environmental  Parameters:  We  investigated  the  correlation  between 
clustering  patterns  observed  using  the  array  and  environmental  parameters,  in 
two  ways.  First,  each  sample  was  assigned  to  its  “oceanographic  season”,  a 
designation  based  on  average  annual  upwelling  patterns  in  Monterey  Bay 
(spring/summer,  fall,  or  winter,  described  in  e.g.  Pennington  et  al.,  2007)  and 
these  designations  were  compared  to  the  samples’  clustering  patterns  (Figure 
4a). 

Second,  canonical  discriminant  analysis  was  used  to  examine  the 
correlation  between  individual  nutrient  (phosphate,  nitrate,  nitrite  and  silicate) 
concentrations  and  sample  variability  (Figure  6b).  Here,  strong  correlations  were 
apparent  to  each  nutrient,  reflecting  large  differences  in  nutrient  conditions  at  the 
three  depths.  Phosphate,  nitrate  and  silicate  drove  the  differentiation  of  the 
shallow  from  the  deep  samples,  while  nitrite  drove  the  separation  of  30m  from 
Om. 


Since  possible  correlations  at  each  depth  were  obscured  by  the  strength  of 
the  nutrient  signals  between  depths,  samples  from  each  depth  were  also  plotted 
in  separate  principal  component  analyses,  and  the  correlations  of  each  nutrient’s 
variability  to  the  first  two  principal  component  axes  were  calculated  (Figure  7). 
(Principal  component  analysis  was  used  for  this  instead  of  canonical  discriminant 
analysis  because  whereas  with  c.d.a.  the  distance  between  all  defined  groups  is 
maximized,  in  p.c.a.  the  total  variability  among  all  samples  is  maximized,  and  we 
chose  not  to  define  subgroups  within  each  depth.)  Variation  in  nutrient 
concentrations  among  samples  accounted  for  little  of  the  variability  among  Om 
samples  (Figure  7a),  with  a  minor  correlation  of  nitrite.  At  30m  (Figure  7b), 
however,  nutrient  variability  correlated  relatively  strongly  to  the  principal 
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component  axes,  with  a  strong  signal  of  phosphate,  nitrate  and  silicate  and  a 
slightly  weaker  and  inverse  signal  for  nitrite.  Finally,  at  200m  (Figure  7c),  nitrate 
and  nitrite  showed  no  and  weak  correlations,  respectively,  while  silicate  and 
phosphate  gave  equally  strong  but  non-overlapping  correlations. 

Population  variations:  Population  shifts  over  time  were  examined  in  two 
ways.  First,  each  target’s  mean  intensity  was  compared  to  its  tukey  biweight 
intensity  within  each  sample.  There  was  a  larger  drop  of  TBW  relative  to  mean 
for  sporadically  distributed  taxa  compared  to  depth-consistent  taxa,  and  also  for 
common  deep  taxa  compared  to  common  shallow  taxa  (Figure  8).  Second,  for 
particular  targets  of  interest,  the  pattern  of  signal  across  the  probe  set  was 
compared  between  samples,  and  the  pair-wise  Pearson  correlation  of  these 
patterns  was  calculated.  Clustering  analysis  of  the  Pearson  correlations  between 
samples  was  then  used  to  reveal  samples  with  more  and  less  similar  probeset 
patterns  for  a  given  genotype.  For  the  SAR86-II  target  EB000_45B06,  this 
process  is  shown  in  Figure  9. 


Discussion 

Over  the  ~4-year  sampling  period  at  Station  M1  in  Monterey  Bay,  a 
significant  portion  of  the  expanded  genome  proxy  array's  targets  showed  signal 
(95  out  of  268  targets,  -35%,  were  present  in  one  or  more  samples).  The 
majority  of  targets  detected  by  array  were  uncultivated  marine  lineages,  many  of 
which  derived  from  the  environment  of  study  (Figure  S6).  Broadly,  there  were 
three  major  patterns  of  target  occurrence  across  the  57  samples  hybridized. 
Some  taxa  were  consistently  abundant  in  most  or  all  samples  of  a  given  depth, 
other  taxa  were  frequently  present  within  their  primary  depth  of  occurrence,  and 
many  taxa  had  sporadic  distributions  in  one  or  more  depths. 

The  genome  proxy  array  platform  was  previously  validated  using  related 
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target  strains  added  into  natural  marine  community  samples  at  a  range  of 
concentrations  (Rich,  Konstantinidis  &  DeLong  2008).  In  this  study  we  further 
validated  the  results  of  the  expanded  platform  by  comparing  its  data  to 
community  genomic  pyrosequencing  data,  for  three  surface  samples.  This 
represented  a  full  methodological  comparison,  encompassing  the  array’s 
potential  biases  in  both  the  amplification  and  labeling  steps  and  the  hybridization 
itself;  pyrosequenced  DNA  was  not  subjected  to  the  same  amplification-and- 
labeling  protocol  as  the  aliquots  used  for  array  hybridization.  Overall  there  was 
strong  correlation  between  taxa  abundance  measured  by  mean  array  intensity 
and  by  BLAST-based  recruitment  of  pyrosequences  to  targeted  genomes  and 
genome  fragments,  with  linear  regression  values  of  0.85-0.91  for  the  three 
samples. 

In  addition,  the  pyrosequence  data  indicated  what  percentage  of  the 
community  could  be  surveyed  by  the  array,  i.e.  what  percent  of  the  community 
was  represented  by  the  targets  on  the  array.  Based  on  the  number  of 
pyrosequence  reads  recruited  to  the  array  target  sequences  at  the  relatively  high 
stringency  used  to  mimic  the  array  hybridization,  the  array  captured  1.9%-2.5% 
of  the  total  reads  in  these  three  samples  (7636/395767  for  0m_2000_298, 
8743/345650  for  0m_2001_115,  and  9252/39197  for  0m_2001_135).  A  recent 
analysis  of  a  similarly-obtained  marine  pyrosequence  dataset  showed  only  50% 
of  reads  had  identity  to  any  Genbank  sequences  (Frias-Lopez  and  Shi  et  al., 
2008),  using  less  stringent  criteria.  Furthermore,  the  ten  targets  with  the  highest 
number  of  recruited  reads  in  each  sample  accounted  for  from  -66%  to  75%  of 
the  total  reads.  In  all  three  cases,  9  of  the  top  10  targets  were  environmental 
genomic  clones,  with  the  tenth  being  a  recently-sequenced  genome  from  the 
NAC1 1-7  clade  of  the  Roseobacteria.  Together  with  the  relative  decrease  in 
marine  genome  observations  versus  presence  on  the  array  in  both  Monterey  Bay 
and  Hawaii  samples,  these  suggest  that  “native”,  uncultivated  DNA  sequences 
are  most  effective  for  investigating  marine  microbial  communities,  and  that  by 
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being  designed  from  such  sequences,  the  array  can  provide  a  useful 
complement  to  other  means  of  community  investigation. 

After  cross-validating  the  expanded  genome  proxy  array  with  pyrosequence 
data,  we  investigated  the  depth-specific  distributions  of  targets  from  particular 
phylogenetic  groups,  across  the  Monterey  Bay  samples,  focusing  on  taxa  that 
occurred  in  multiple  samples.  One  of  the  most  highly  represented  groups  was 
Roseobacter,  which  are  known  to  comprise  up  to  20%  of  cells  in  coastal  samples 
(reviewed  in  Buchan  etal.,  2005),  are  ecologically  diverse,  and  include  both 
cultivated  and  uncultivated  lineages.  Roseobacteria  have  been  described 
previously  as  abundant  in  Monterey  Bay,  accounting  for  20-40%  of  total  bacterial 
SSL)  DNA  in  the  mid-bay  region  during  an  upwelling  event  (Suzuki  et  al.,  2001 ). 

In  large-insert  genomic  libraries  from  this  site,  the  NAC1 1-7  and  CHAB-l-5  clades 
accounted  for  -22%  and  -6%,  respectively,  of  the  SSL)  operon-containing  clones 
of  both  the  Om  and  80m  libraries,  representing  -65%  of  the  total  Roseobacter 
signal  in  each  (Suzuki  et  al.,  2004).  The  array  abundance  of  Roseobacter  targets 
agrees  with  previous  estimations  of  their  abundance  (Figures  4a  and  S7a).  A 
significant  number  (3  of  8)  of  taxa  in  the  shallow-consistent  cluster  were  NAC1 1  - 
7  clones  (EB080_L1 1 F12,  EB080_L43F08,  EB080_L27A02)  as  were  2  of  12 
shallow-frequent  targets  (EB080_L06A09  and  the  NAC1 1-7  genome 
Rhodobacterales  HTCC2255).  Overall,  NAC11-7  represented  25%  of  the 
targeted  taxa  that  commonly  occurred  (frequenter  consistent  clusters)  in  shallow 
samples.  Lastly,  1  of  10  deep-consistent  taxa  was  a  CHAB-l-5  clone 
(EB000_36F02).  In  addition  to  their  high  surface  abundances  generally,  the 
differential  distributions  of  three  of  the  Roseobacter  NAC1 1-7  targets 
(EB080_L27A02,  EB080_L43F08,  and  HTCC2255)  between  Om  and  30m 
samples  helped  drive  the  differentiation  of  these  samples  (Figure  6a). 

Members  of  the  uncultivated  gammaproteobacterial  SAR86  clade  were  also 
abundant  in  shallow  samples.  SAR86  has  been  commonly  reported  in  marine 
samples  (Eilers  et  al.,  2000,  Rappe  et  al.,  2000,  Suzuki  et  al.,  2001 ,  Venter  et  al.. 
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2004,  Morris  etal.,  2006),  is  known  to  partition  with  depth  (Morris  eta/.,  2006), 
and  can  comprise  up  to  10%  of  the  cells  in  a  community  (Mullins  etal.,  1995, 
Eliers  et  ai,  2000,  Morris  et  al.,  2006).  Furthermore,  it  has  been  previously 
described  as  abundant  in  Monterey  Bay,  as  3-6%  of  total  bacterial  SSU  DMAs  in 
the  Bay  during  an  upwelling  event  (Suzuki  etal.,  2001),  and  as  5.6%,  5.5%,  and 
1.6%  of  the  SSU  operon-containing  clones  in  Om,  80m  and  100m  large-insert 
clone  libraries  from  this  location  (Suzuki  et  al.,  2004).  Array-based  sample 
profiling  recapitulated  this  importance  (Figures  4a  and  S7b),  as  2  of  8  shallow- 
consistent  taxa  were  SAR86-II  clones  (EB000_31A08  and  EB000_45B06),  and  a 
SAR86-III  clone  (EBAC_27G05)  was  among  the  12  frequent-shallow \axa.  All 
three  clones  possess  proteorhodopsin  (PR)  genes,  and  PR-containing  SAR86 
types  have  been  hypothesized  to  be  photoheterotrophs  (Beja  et  al.,  2000,  Sabehi 
etal.,  2004,  Sabehi  et  al.,  2005,  Mou  etal.,  2007,  Sabehi  etal.,  2007).  The 
distribution  of  the  SAR86-III  clone  also  helped  drive  the  differentiation  of  30m 
samples  from  those  at  Om  (Figure  6a). 

The  alphaproteobacterial  SAR1 1  clade  is  one  of  the  most  abundant  in  the 
world’s  oceans  (e.g.  Morris  et  al.,  2002)  and  was  isolated  from  coastal  waters 
approximately  700  miles  north  of  the  study  area  (Rappe  et  al.,  2002).  Seven  of 
the  10  targeted  SAR11  genotypes  were  present  in  >  1  Monterey  Bay  sample, 
and  each  showed  depth-specific  distribution  (Figures  4a  and  S7c).  Pelagibacter 
HTCC1062  and  HTCC1002,  cultivated  strains  both  in  the  SAR11  subgroup  la, 
were  present  only  in  shallow  samples  and  occurred  frequently  but  not 
consistently.  Several  other  SAR11  genotypes  were  present  only  in  deep 
samples,  and  occurred  frequently  or  sporadically  (HF4000_37C10, 
HF4000(384)_009C18,  HF0770_37D02,  EBAC750_11E01,  and  EB750_09G06). 
This  is  consistent  with  the  known  depth  distributions  of  the  two  major  SAR1 1 
clades  (e.g.  Stingl  etal.,  2007).  Furthermore,  the  distribution  of  HTCC1062  and 
HTCC1002  showed  no  correlation  to  upwelling  season,  consistent  with  previous 
observations  that  their  numbers  do  not  change  under  phytoplankton  bloom 
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conditions  (Morris  etal.,  2005). 

Proteorhodopsin-  (PR)-  containing  targets  produced  strong  array  signals 
throughout  the  shallow  samples.  In  addition  to  the  SAR86  clones,  a  number  of 
PR-containing  targets  without  phylogenetic  markers  were  among  the  shallow- 
consistent  (3  clones)  and  shallow-frequent  clusters  (4  clones).  These  targets 
were  designated  as  various  Proteobacteria  based  on  BLAST-based  identities.  In 
total,  targets  known  to  carry  the  proteorhodopsin  gene  accounted  for  50%  of  the 
taxa  abundant  in  shallow  samples  (5  of  12  shallow-frequent  and  6  of  10  shallow- 
consistent  taxa).  Two  of  these  PR-containing  clones  had  sufficiently  inverted 
relative  abundances  at  Om  and  30m  to  contribute  to  the  differentiation  of  the  two 
depths  (Figure  6a;  EB000_39F01  in  Om,  and  EB000_39H12  in  30m).  These 
observations  are  in  agreement  with  the  increasing  awareness  of  high 
proteorhodopsin  gene  abundances  in  photic  zones  (Beja  et  a!.,  2000,  Sabehi  et 
al.,  2004,  McCarren  et  al.,  2007,  Rusch  et  al.,  2007)  and  of  the  emerging 
suggestions  of  PR-based  photoheterotrophs  as  abundant  components  of  photic 
communities  (Sabehi  et  al.,  2005,  Stingl  et  al.,  2007,  Gomez-Consarnau  et  al., 
2007,  Moran  and  Miller,  2007,  Gonzalez  et  al.,  2008). 

One  of  the  other  shallow-frequent  taxa  was  a  representative  of  the  OM43 
clade  (target  EB000_36A07),  which  has  been  observed  to  respond  to  diatom 
blooms  (Morris  et  al.,  2006).  These  blooms  occur  in  MB  during  the  first  upwelling 
season  (e.g.  Pennington  et  al.,  2007).  In  our  MB  samples,  a  general  correlation 
between  this  clone’s  occurrence  and  upwelling  season  was  not  observed. 
However,  during  specific  post-bloom  samples  with  particularly  high  array 
intensities  (see  below),  this  OM43  target  was  among  the  small  number  of  targets 
with  the  most  dramatic  increases  in  intensity. 

The  final  bacterial  target  within  the  shallow-frequent  cluster  was  a  SAR1 16-1 
clone  (EB000_46D07).  Of  12  SAR1 16  targets,  two  originated  in  Monterey  Bay, 
and  were  the  only  ones  detected.  The  SAR1 16-11  target  (EB00_37G09)  was 
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present  only  twice,  in  Om  samples,  while  the  SAR1 16-1  clone  was  present  in  21 
of  34  shallow  samples.  In  large-insert  environmental  libraries  from  this  site, 

SAR1 16  comprised  11.3%,  1.4%,  and  0.8%  of  the  SSU  operon-containing  clones 
in  Om,  80m,  and  100m  libraries,  respectively  (Suzuki  et  a!.,  2004).  This 
Rhodospirilalles  clade  has  broad  global  distribution  and  frequently  high 
abundances  (e.g.  Giovannoni  and  Rappe,  2000,  Delong  etal.,  2006,  Rusch  et 
al.,  2007),  but  has  only  recently  been  isolated  (Stingl  et  at.,  2007).  Due  to  the 
phylogenetic  diversity  of  this  clade  (at  least  10%  divergent  16S  rRNA,  Stingl  et 
al.,  2007),  it  may  be  that  the  relative  specificity  of  the  array  platform  prohibited  it 
from  tracking  other  native  SAR1 16  strains;  that  is,  that  the  other  SAR1 16  targets 
on  the  array  did  not  share  sufficient  identity  (i.e.  ~<80%  ANI)  with  local 
populations  to  produce  array  signal.  An  alternative  explanation  is  that  the 
previously-constructed  Om  large-insert  library  captured  an  unusual  bloom  of 
SAR1 16  at  this  location,  and  that  they  are  not  normally  present  at  -10%  of 
surface  populations.  However,  a  bloom  scenario  seems  unlikely,  because  the 
captured  SAR1 16  rRNA  genes  in  the  three  libraries  spanned  the  breadth  of 
SAR1 16  diversity.  The  array  results  suggest  that  additional  sequencing  of 
previously-captured  SAR1 16  clones  from  this  location  may  be  appropriate,  to 
further  and  best  represent  native  populations.  To  identify  which  SAR1 16  clone(s) 
would  be  optimal,  the  surface  pyrosequence  databases  can  be  queried  with  this 
clade’s  rRNA  sequences. 

Three  marine  archaeal  targets  were  among  the  most  abundant  targeted 
taxa  in  the  MB  samples.  Furthermore,  of  15  total  archaeal  genotypes  targeted  by 
the  array,  7  were  present  in  at  least  one  MB  sample.  Typically,  marine 
euryarchaea  are  seen  in  low  numbers  in  the  water  column  while  marine 
crenarchaea  increase  with  depth  and  can  account  for  a  significant  proportion  of 
the  total  microbial  community  in  deeper  waters  (Massana  et  al.,  1997,  Karner  et 
al.,  2001,  Pernthaler  etal.,  2002).  In  Monterey  Bay,  pelagic  crenarchaeal  cells 
have  also  been  shown  to  increase  with  depth  and  to  represent  up  to  33%  of  the 
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200m  community,  while  euryarchaeal  cells  were  more  abundant  in  shallow 
samples  (up  to  12%  of  the  community  in  the  summer,  but  less  than  1% 
throughout  the  water  column  in  winter,  Pernthaler  et  al,  2002,  and  frequently 
below  FISH-based  detection  at  200m,  Mincer  et  al.,  2007).  These  trends  were 
generally  reflected  in  the  array  data.  Four  euryarchaeal  clones  were  present  in 
the  water  column.  One  (EB000_37F11)  was  in  the  frequent-shallow  c\a6e,  two  of 
which  were  in  the  abundant  deep-consistent  cluster  (DeepAnt_EC39  and 
EF10057A08),  and  the  last  of  which  (DeepAnt_JyKC7)  occurred  in  only  a  single 
200m  samples.  The  overall  frequency  of  these  archaeal  targets  suggests  a 
consistent  presence  of  these  taxa  at  this  location,  throughout  the  water  column. 
Two  crenarchaeal  targets  were  present  in  these  Monterey  Bay  samples,  and 
both  were  restricted  to  200m  samples.  One  (ANT74A4)  had  only  sporadic 
occurrence,  while  the  other  occurred  quite  frequently  (4B7,  in  13  of  23  200m 
samples).  Finally,  a  putatively-archaeal  target  of  unknown  identity 
(EB750_01  A01 )  was  sporadically  present  in  the  deep  samples.  The  presence  of 
euryarchaeal  clones  in  both  shallow  and  deep  samples,  and  the  restriction  of 
crenarchaeal  clones  to  the  deepest  samples,  reflected  the  general  trends  seen 
previously. 

It  is  notable  that  two  euryarchaeal  clones  were  among  the  most  abundant 
taxa  at  200m  across  all  sampling  dates,  given  the  clade’s  previously  documented 
maxima  near  the  surface  and  their  low  numbers  or  absence  in  some  studies  of 
deep  waters  at  this  location.  However,  previous  FISH-based  studies  used 
surface  rather  than  deep  euryarchaeal  phylotypes  to  generate  probes,  and  other 
studies  using  rRNA  clone  libraries  have  noted  appreciable  euryarchaeal 
abundances  in  deep  waters  (Lopez-Garcia  etai,  2001,  Massana  et  al.,  1997). 
This  observation  highlights  the  challenge  in  cross-comparing  techniques  with 
different  levels  of  phylogenetic  specificity.  While  previous  FISH-based 
investigations  targeted  broad  phylogenetic  groupings,  the  genome-proxy  array 
targeted  specific  genotypes.  One  of  the  array’s  two  deep-abundant  clones 
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originated  from  100m  in  Monterey  Bay,  while  the  other  came  from  500m  in  the 
Antarctic  Polar  Front  (Lopez-Garcia  et  al.,  2004).  The  array  results  can  only 
describe  the  taxa  targeted  and  cannot  be  generalized  to  the  clades  within  which 
those  taxa  occur,  and  thus  adding  additional  archaeal  targets  to  the  array  will  be 
important  in  expanding  the  breadth  of  archaeal  investigations  to  better  span 
known  marine  diversity. 

Previous  work  at  this  location  also  strongly  suggested  that  Crenarchaea 
play  a  significant  role  in  ammonia  oxidation,  and  mapped  their  initial  appearance 
in  the  water  column  to  the  nitracline  (Mincer  et  al.,  2007).  In  addition,  co¬ 
occurrence  of  Crenarchaea  and  putatively  nitrite-oxidizing  Nitrospina  species 
indicated  a  possible  metabolic  link  between  these  two  groups  at  this  location 
(Mincer  et  al.,  2007).  qPCR  analyses  of  the  relative  ratios  of  crenarchaeal  SSL) 
rRNA  and  amoA  genes  in  four  depth  profiles  showed  a  1 :1  correlation  throughout 
the  water  column,  an  increase  with  depth,  and  maxima  at  200m  in  three  of  the 
four  profiles.  In  addition  to  the  concordance  of  the  array-based  abundance  of  the 
two  crenarchaeal  targets,  the  array  included  a  Nitrospina  target,  and  their 
distributions  also  agreed  with  the  previous  observations.  The  Nitrospina  clone 
(EB080_L20F04)  was  apparent  sporadically  in  a  single  30m  sample  and  in  200m 
samples,  a  subset  of  those  in  which  the  two  crenarchaeal  targets  were  present  (5 
of  13  200m  samples),  and  at  lower  signal  intensities.  Previous  qPCR  surveys 
showed  Nitrospina  SSU  rRNA  to  parallel  the  distribution  of  crenarchaeal  SSU 
rRNA  but  with  lower  abundances  (Mincer  et  al.,  2007).  Interestingly,  the  array 
also  targeted  a  betaproteobacterial  ammonia-oxidizer  Nitrosomonas  clone 
(EB080_L12H07)  captured  from  80m  in  Monterey  Bay,  and  its  distribution  was 
similar  to  that  of  the  Nitrospina  clone.  qPCR  surveys  of  Nitrosomonas-Wke  amoA 
sequences  at  four  sampling  dates  produced  very  low  counts  throughout  the 
water  column,  reinforcing  the  sporadic  nature  of  this  taxon’s  presence  in 
Monterey  Bay. 

Returning  to  the  depth-specific  clustering  of  taxa  in  the  array  profiles. 
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additional  deep-consistent  genotypes  included  four  pelagic  relatives  of  deep-sea 
invertebrate  (e.g.  vesicomyid  clam)  symbionts.  Two  such  16S-containing  clones, 
a  4000m  HOT-derived  clone  (HF4000_23L14),  related  to  ZD0405  (a  pelagic  16S- 
clone  related  to  symbionts),  and  an  80m  MB-derived  clone  (EB080_L31E09), 
were  consistently  abundant  at  200m,  with  the  latter  being  the  most  abundant 
targeted  genotype  at  this  depth.  In  addition,  two  rubisco-containing  clones 
(EB750_10B11,  EB750_10A10)  without  phylogenetic  markers,  whose  BLAST 
homology  indicated  possible  relatedness  to  symbionts,  were  also  abundant  in  the 
deep.  Furthermore,  two  of  these  four  clones  were  important  in  driving  the 
differentiation  of  200m  samples  from  shallower  ones  (Figure  5).  It  is  not 
uncommon  for  such  pelagic  relatives  of  symbiont  species  to  be  found  in  marine 
16S  surveys  (e.g.  Suzuki  et  al.,  2001,  Lopez-Garcia  e/a/.,  2001,  Bano  and 
Hollibaugh,  2002,  Zubkov  eta!.,  2002,  Klepac-Ceraj,  2004,  thesis).  Given  the 
availability  of  large  genomic  fragments  from  these  symbiont-related  organisms, 
investigation  of  their  potential  lifestyle  is  possible.  In  this  way,  there  is  a  feedback 
between  array  data  and  other  methods,  as  distribution  information  seen  with  the 
array  can  extend  metagenomic  snapshots,  particularly  for  groups  of  emerging 
interest  like  the  symbiont-relatives  which  have  not  been  studied  in  depth  with 
more  focused  methods  such  as  qPCR.  In  addition,  array-based  evidence  for 
different  populations  can  motivate  the  exploration  of  particular  hypotheses  within 
metagenomic  data. 

Two  deltaproteobacterial  clones  (EB750_01B07,  HF4000_08N17  -  the  latter 
within  the  SAR324  clade)  were  also  within  the  deep-consistent  cluster,  consistent 
with  the  previous  depth  preference  described  for  this  group  (e.g.  Wright  et  al., 
1997).  These  targets  were  also  highly  correlated  to  the  differentiation  of  200m 
from  Om  and  30m  samples.  Finally,  the  remaining  deep-consistent  target  was  the 
gammaproteobacterial  clone  EB080_L93H08,  which  clustered  together  with 
deep-sea  environmental  clones  from  around  the  world,  notably  ZD0417  and 
DHB-2  (Lopez-Garcia  et  al.,  2001 ),  although  the  natural  history  of  this  clade 
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remains  a  mystery.  The  array  data  demonstrating  the  consistency  of  this  clade’s 
presence  in  200m  Monterey  waters,  combined  with  its  common  occurrence  in 
16S  rRNA-gene  clone  surveys  in  a  variety  of  locations,  suggest  it  warrants 
further  study.  As  this  clade  remains  uncultivated,  genome  fragments  provide  an 
important  window  into  its  potential  lifestyle,  and  array  profiling  of  its  abundance  at 
other  sites  over  time  could  help  define  its  habitat. 

Interestingly,  targeted  cyanobacteria  did  not  show  strong  or  consistent  array 
signal  in  Monterey  Bay.  However,  the  episodic  surface  appearance  of 
Prochlorococcus  MED4  helped  differentiate  Om  from  30m  samples  (Figure  5). 
Also,  the  use  of  a  1.6pm  pre-filter  during  sample  collection  likely  excluded  larger 
Synechococcus  cells. 

The  sole  Hawaii  depth  profile  showed  markedly  different  taxa  abundances 
than  Monterey  samples,  although  it  retained  strong  depth-based  clustering 
(Figure  4b).  When  clustered  together  with  Monterey  Bay  samples,  the  Hawaii 
500m  sample  was  more  like  200m  Monterey  samples,  although  it  was  basal  to 
that  cluster,  while  the  shallower  three  Hawaii  samples  formed  their  own  cluster 
separate  from  all  Monterey  samples  (Figure  4c).  No  taxa  in  the  500m  HOT179 
sample  appeared  in  the  shallower  three  samples,  and  vice  versa.  A  notable 
difference  in  shallow  Hawaii  taxa  compared  to  Monterey  taxa  were  the 
cyanobacteria,  with  a  general  lack  thereof  in  Monterey  Bay,  while 
Prochlorococcus  strains  9312  and  ASC9601  were  the  most  abundant  signals  at 
all  three  shallow  depths.  The  dominance  of  these  clades  was  consistent  with 
previous  metagenomic  work  at  this  location  (e.g.  Coleman  et  al.,  2006,  DeLong 
et  al.,  2006).  The  other  shallow  taxa  were  also  different  from  those  present  in 
Monterey,  with  the  majority  never  occurring  there  or  only  occurring  sporadically. 
Another  notable  difference  between  the  two  locations’  profiles  was  the 
appearance  of  more  discrete  zonation  in  the  Hawaii  data;  all  shallow  samples  did 
not  appear  similar.  However,  a  large  caveat  to  the  HOT179  profile  must  be 
offered  here.  The  triplicate  array  hybridizations  used  to  generate  these  data  were 
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dimmer  than  Monterey  Bay  hybridizations,  despite  using  the  same  amount  of 
starting  DNA.  Using  the  same  data-filtering  parameters  optimized  for  the 
Monterey  Bay  profiles,  in  order  to  most  robustly  allow  cross-comparison  and 
clustering,  very  few  taxa  were  “present”  in  the  Hawaii  profiles.  There  are  several 
possible  explanations  for  the  poor  quality  data  obtained  from  this  Hawaii  profile: 
(i)  estimates  of  DNA  concentrations  were  inaccurate,  and  less  DNA  was 
hybridized,  (ii)  the  quality  of  the  DNA  was  poor,  with  inhibitors  present  (though 
this  is  unlikely  as  it  was  spectrophotometrically  clean  and  had  been  thoroughly 
extracted  and  cleaned),  (iii)  data  processing  parameters  optimized  for  one 
location  cannot  be  transferred  to  another  site;  this  would  confound  cross-site 
comparisons,  and  is  not  indicated  by  previous  work  with  the  array,  or  (iv)  there 
was  something  else  substandard  in  the  HOT179  hybridization  or  scanning 
process  which  resulted  in  less  signal.  Based  on  previous  hybridizations  with 
assorted  Hawaii  samples,  I  believe  a  combination  of  (i)  and  (iv)  is  most  likely. 
The  data  obtained,  using  either  Monterey-tailored  filtering  parameters,  unfiltered 
data,  or  empirically  tuned  filtering  parameters,  are  consistent  with  taxa 
expectations  for  the  location.  Also,  previous  research  (Rich,  Konstantinidis  and 
DeLong,  2008)  showed  transferability  of  the  prototype  array  to  another  coastal 
location  in  a  different  ocean  basin  (Atlantic)  using  identical  processing 
parameters.  Thus,  it  seems  likely  that  the  array  will  be  able  to  be  used  across 
locations  without  retuning  the  data  processing  pipeline. 

In  addition  to  examining  clade-related  depth  distributions,  Monterey  Bay 
samples  were  further  investigated  for  variability.  Variability  among  samples  and 
its  causes  and  significance  is  a  major  consideration  when  dealing  with  natural 
environmental  samples.  As  indicated  by  branch  length  on  the  sample  clustering, 
there  was  much  more  variability  among  shallow  samples  than  deep  ones,  as 
would  be  expected  based  on  their  more  variable  oceanographic  conditions. 
Profile  variability  did  not  correlate  overall,  however,  to  Monterey  Bay’s  typical 
oceanographic  seasons  (Figure  4a:  spring/summer  upwelling,  fall  upwelling,  and 
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winter  non-upwelling,  as  defined  in  e.g.  Pennington  and  Chavez,  2000, 
Pennington  et  al.,  2007).  There  is  substantial  yearly  oceanographic  variability  at 
this  location  in  the  timing  of  upwelling  events,  though,  and  phytoplankton 
abundance  and  growth  rates  can  be  “strikingly  pulsed”  (Pennington  and  Chavez, 
2000).  The  dynamics  of  the  sampled  periods  did  not  fit  the  time-averaged 
seasonal  delineations,  so  it  may  not  be  surprising  that  there  was  little  apparent 
correlation  between  sample  profiles  and  the  site’s  typical  oceanographic 
seasons.  Profiling  of  additional  years  might  reveal  a  stronger  cumulative  signal 
among  seasons.  Alternately,  a  more  focused  taxa-by-taxa  correlation  analysis 
could  reveal  correlations  to  oceanographic  season  that  are  not  evidenced  in 
community-wide  profiles  but  are  present  in  some  subset  of  taxa  present. 

Sample  variability  was  reflected  not  only  in  cluster  branch  length,  however, 
but  also  in  the  relative  intensities  in  each  sample’s  profile,  with  much  greater 
heterogeneity  in  intensity  among  shallow  profiles  than  deep  ones.  In  particular, 
several  shallow  sampling  dates  were  notably  intense  (red  starred  samples  in 
Figure  4a).  The  date  with  the  highest  intensities  is  April  25th,  2001,  which  occurs 
just  after  the  largest  upwelling  event  in  the  first  19-mos  sampling  period  (as 
indicated  by  nitrate  concentrations:  sampling  date  481  in  Figure  10).  Other 
particularly  intense  samples  include  0ct3_2000,  0ct25_2000,  May15_2001, 
Oct21_2003,  and  Mar31_2004.  These  samples  were  all  collected  after  upwelling 
events,  during  upwelling  seasons  (Figure  10;  red  arrows  and  black  dashed 
vertical  lines). 

Previous  studies  have  shown  that  different  phytoplankton  dominate  the 
spring/summer  versus  fall  upwellings  (Pennington  et  al.,  2007),  which  might 
suggest  that  different  bacteria  would  also  be  apparent  after  spring  versus  fall 
upwelling  events,  even  if  there  were  not  strong  community  differences  between 
the  annually-averaged  seasons  overall.  However,  not  only  do  the  intense  post- 
upwelling  profiles  not  all  cluster  monophyletically,  indicating  that  their  profiles  are 
not  consistently  most  similar  to  one  another,  but  fall  and  spring  upwelling  profiles 
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do  not  each  cluster  together  either.  This  suggested  that  despite  phytoplankton 
differences  among  upwelling  seasons,  those  taxa  targeted  by  the  array  do  not 
follow  the  same  trend,  at  least  within  the  inter-annual  variability  encompassed  by 
these  samples. 

Thus,  at  this  study’s  sampling  frequency,  there  did  appear  to  be  a  post- 
upwelling  signature  in  these  data,  but  at  the  scale  of  individual  events  rather  than 
across  seasons,  and  in  the  form  of  increased  signal  from  pre-existing,  common, 
abundant  taxa  rather  than  unique  ones.  The  strongest  signals  came  from  a  group 
of  NAC1 1-7  targets  (EB080_L1 1 F12,  EB080_L43F08,  EB080_  L27A02,  and 
HTCC2255),  and  two  PR-containing  alphaproteobacterial  clones  lacking 
phylomarkers  (EB000_39F01,  EB000_55B11).  As  described  above,  these  six  are 
all  within  the  shallow-consistent  or  frequent  cluster  of  targets.  The  NAC1 1-7 
roseobacterial  clade  is  often  associated  with  bloom  and  post-bloom  conditions 
(as  reviewed  in  Buchan  et  a!.,  2005),  ostensibly  due  to  the  common 
roseobacterial  ability  to  degrade  dimethylsulfoniopropionate,  an  osmolyte 
produced  by  a  variety  of  phytoplankton.  Thus,  the  prominent  role  of  NAC1 1-7 
targets  in  the  array  data  from  this  coastal  upwelling  site,  and  their  particular 
intensity  after  bloom  conditions,  is  consistent  with  previous  observations  of  this 
clade. 

It  may  be  surprising,  however,  that  PR-containing  targets  (the  two  without 
phylogenetic  markers,  and  the  NAC1 1-7  HTCC2255  genome)  would  be  among 
those  with  the  strongest  post-bloom  responses.  The  diversity  of  lineages 
containing  proteorhodopsin  genes,  and  their  abundance  in  a  variety  of  photic 
marine  habitats  implies  a  probable  diversity  in  PR  lifestyle  use.  The  role  of  the 
PR  gene  in  the  ubiquitous  SAR1 1  clade  has  remained  unclear  but  has  been 
hypothesized  to  allow  survival  during  lean  oligotrophic  conditions  (Giovannoni  et 
al.,  2005,  Schwalbach  eta!.,  2005).  Alternately,  the  PR-containing  Bacteroidetes 
cultivar  Dokdonia  sp.  MED134  showed  increased  growth  in  light  versus  dark 
conditions  in  a  laboratory  culture  (Gomez-Consarnau  et  al.,  2007).  Many 
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Bacteroidetes,  and  Flavobacteria  in  particular  (of  which  MED134  is  one),  are 
abundant  during  and  after  phytoplankton  blooms,  and  it  was  hypothesized  that  in 
end-bloom  conditions  of  decreasing  organic  matter,  PR  might  allow  MED134  to 
persist  as  other  heterotrophs  declined  (Gomez-Consarnau  et  al.,  2007).  An 
additional  cellular  lifestyle  that  may  be  linked,  in  some  lineages,  to  the  PR  gene, 
is  a  cyclic  lifestyle  alternating  between  attached  and  free-living  stages.  In  this 
case,  PR  could  provide  energy  to  help  cross  the  “deserts"  between  particles 
(postulated  for  the  Flavobacterial  cultivar  Polaribacter  sp.  MED152  in  Gonzalez 
et  al.,  2008).  Thus,  the  array-based  abundance  of  PR-containing  targets  during 
bloom  and  post-bloom  conditions  could  have  several  possible  explanations.  First, 
it  might  simply  reflect  that  these  taxa  were  highly  competitive  heterotrophs  under 
bloom  conditions,  with  PR  genes  being  incidental  to  the  bloom-related  phase  of 
their  lifestyle.  Second,  like  the  hypothesized  role  in  the  MED134  cultivar,  PR 
might  have  allowed  these  taxa  to  persist  longer  than  other  heterotrophs  as  the 
bloom  waned.  Lastly,  the  PR  might  have  played  a  more  an  active  role  in  bloom 
utilization,  helping  provide  the  energy  for  organic  matter  uptake  and/or 
degradation,  and  allowing  these  heterotrophs  to  compete  more  effectively  for 
bloom  carbon.  From  the  current  information,  we  cannot  assess  the  relative 
likelihood  of  each  scenario.  However,  additional  oceanographic  data  from  these 
and  adjacent  sampling  dates  could  help  identify  bloom  stage.  Also,  three  of  the 
intense  array  profiles  have  associated  pyrosequence  data.  It  could  be  used  to 
quantify  actual  numerical  dominance  of  the  PR-containing  clones  more  directly 
rather  than  inferred  from  array  intensity,  and  compared  to  the  other  heterotrophs 
present. 

In  addition  to  examining  sample  variability  through  the  lenses  of 
oceanographic  season  and  of  upwelling  events  and  associated  blooms,  we 
looked  more  precisely  at  the  environmental  variability  through  actual  nutrient 
concentrations  in  each  sample,  and  their  correlations  to  the  major  variability  in 
the  data,  to  both  canonical  discriminant  (c.d.)  and  principal  component  (p.c.) 
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axes.  With  the  variability  among  the  three  depths'  array  profiles  maximized  in  a 
single  CDA,  the  strong  correlations  of  each  axis  to  nutrient  concentrations 
(Figure  6b)  simply  recapitulated  the  oceanographic  differences  in  nutrient 
concentrations  with  depth  at  this  location.  The  higher  concentrations  of  silicate, 
phosphate  and  nitrate  in  the  deeper  samples  (seen  in  the  oceanographic  data 
plotted  in  Figure  10)  were  reflected  in  those  nutrients’  correlations  to  the  first  c.d. 
In  addition,  the  correlation  of  nitrite  to  the  second  c.d.  indicated  that  30m  was  a 
chemically,  not  just  photically,  distinct  environment  from  Om  (also  see  Figure  10). 
A  water  column  nitrite  maximum  is  commonly  seen  below  the  mixed  layer  due  to 
active  denitrification  of  organic  nitrogen  entering  from  above,  and  at  Station  Ml, 
30m  represents  the  base  of  the  mix  layer  through  much  of  the  year  (Figure  S8). 

Based  on  the  markedly  different  chemical  and  photic  environments  of  the 
Om  and  30m  samples,  it  is  surprising  that  there  were  not  larger  differences  in  the 
Om  and  30m  array  profiles.  However,  mixed  layer  depth  is  quite  dynamic  at  this 
site,  as  seen  both  by  the  calculated  MLD  across  sampling  dates  (Figure  S8)  and 
in  temperature-vs. -depth  profiles  for  each  sampling  date  (not  shown),  which 
usually  show  a  gradual  decrease  of  temperature  with  depth  rather  than  a  discrete 
thermocline.  Thus,  because  of  water  column  mixing,  these  two  communities  may 
have  been  frequently  homogenized.  In  addition,  even  without  mixing,  we  would 
have  expected  the  30m  communities  to  include  a  subset  of  Om  communities, 
particularly  for  larger-celled  taxa,  due  to  particle  sinking.  Although  the  Om  and 
30m  array  profiles  did  not  cluster  together,  some  subtle  differences  were 
revealed  by  the  correlation  of  taxa  abundances  to  CDA  axes  (Figure  6a),  which 
showed  a  small  number  of  taxa  (EBAC_27G05,  EB000_65A11,  and 
EB000_39H12)  were  differentially  common  and  abundant. 

Each  of  the  three  sampled  depths,  when  investigated  separately,  showed 
distinct  relationships  between  nutrient  variability  and  array  profile  variability  in 
single-depth  PCA  correlations  to  nutrients  (Figure  7).  At  Om,  there  was  no 
appreciable  correlation  between  nutrient  concentration  and  sample  variability 
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(Figure  7a).  This  is  somewhat  surprising  given  the  post-upwelling  intensity 
signature  in  the  communities.  However,  the  uptake  of  upwelled  inorganic 
nutrients  is  rapid,  and  subsequent  organic  forms  of  these  nutrients  were  not 
measured.  In  addition,  the  strong  wind-based  homogenization  of  the  mixed  layer 
might  obscure  relationships  between  patchy  surface  nutrients  and  community 
profiles.  By  30m,  however,  array  profile  variability  was  related  to  nutrient 
concentrations.  The  nutrient  signatures  of  upwelling  events  (nitrate,  phosphate 
and  silicate)  were  correlated  to  sample  variability,  as  were  the  episodic  nitrite 
maxima  caused  by  remineralization,  with  an  opposite  vector  direction,  as 
expected  (Figure  7b).  At  200m,  the  picture  was  more  complex.  Although  200m  is 
a  more  stable  and  homogenous  chemical  environment  than  shallower  depths, 
there  remained  considerable  intra-annual  variability  in  nutrient  concentrations 
(Figure  10),  reflecting  deeper  upwelling,  advection,  etc.  In  this  case,  however,  the 
upwelling-characteristic  nutrients  appeared  decoupled;  the  correlation  vectors  for 
silicate  and  phosphate  were  offset  but  congruent,  while  nitrate  showed  no 
correlation  to  array  profile  variability  (Figure  7c).  In  addition,  nitrite  produced  a 
correlation  vector  smaller  and  roughly  perpendicular  to  those  of  phosphate  and 
silicate.  The  200m  samples  most  influenced  by  the  higher  silicate  and  phosphate 
(2001_Apr_25,  2002_Apr_1 1, 2004_Jan_21, 2004_Mar_10,  2004_Mar_31, 
2004_May_3)  are  near  the  spring  bloom  timing  for  each  of  the  sampled  years 
(2003  was  not  sampled  in  the  spring),  although  for  2002  the  oceanographic  data 
do  not  indicate  a  preceding  upwelling  event.  These  dates  include  two  which  also 
showed  highest  Om  array  profile  intensities.  Focusing  specifically  on  the  2004 
samples,  a  decoupling  of  silicate  and  phosphate  was  apparent  in  the 
oceanographic  data  (Figure  10)  as  well.  For  example,  on  May  3'^'^,  phosphate 
concentration  was  high,  silicate  was  high,  and  yet  there  was  a  dramatic  drop  in 
nitrate  levels,  compared  to  the  surrounding  time  periods  and  occurring 
throughout  the  water  column. 

Diatoms  dominate  the  spring  upwelling  at  this  location  (Pennington  et  al., 
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2007).  I  hypothesize  that  the  temporal  pattern  in  nitrate,  phosphate  and  silicate 
concentrations  at  200m,  particularly  evident  in  dramatic  upwelling  series  in  spring 
2004,  and  the  strong  correlation  of  array  profile  variability  to  silicate  and 
phosphate  and  decoupling  from  nitrate,  represent  post-diatom-bloom 
remineralization  signatures.  The  sequence  of  events  begins  as  cold  nutrient-rich 
water  upwells  through  the  water  column;  this  is  seen  most  clearly  in  early  spring 
of  2004.  As  diatoms  bloom  and  begin  to  settle  through  the  water  column,  they 
are  remineralized  and  may,  depending  on  flux  rates,  produce  a  short-lived 
phosphate  increase,  as  in  mid-spring  2004.  Depending  on  the  volume  of  settling 
material,  organic  matter  degradation  may  strip  that  water  of  some  nutrients, 
which  may  explain  the  sharp  drop  in  nitrate  throughout  the  water  column  so  soon 
after  its  upwelling-associated  spike,  concurrent  with  the  high  levels  of  phosphate; 
remineralized  nitrogen  in  the  initial  form  of  ammonia  is  consumed  before  it  can 
be  converted  to  nitrate,  and  existing  nitrate  is  also  taken  up  by  the  actively 
degrading  community.  (Low  nitrate  levels  are  not  explained  by  rapid  nitrification, 
since  the  relatively  small  spike  of  nitrite  occurs  later  and  is  of  insufficient 
magnitude).  Finally,  as  the  more  recalcitrant  frustule-associated  component  of 
the  sinking  diatomaceous  organic  matter  becomes  a  higher  percentage  of  the 
total  available  organic  matter,  silicate  concentrations  increase  as  silicate  is 
remineralized.  Additional  oceanographic  data  may  shed  light  on  the  likelihood  of 
this  post-bloom  remlneralizatlon  hypothesis  as  an  explanation  for  the  observed 
200m  correlations. 

Variability  among  samples  can  be  considered  not  only  in  the  local  context, 
with  Monterey  Bay  as  a  particularly  dynamic  environment,  but  also  in  the  context 
of  the  marine  environment  more  broadly.  Ocean  surveys  can  be  affected  by 
strong  spatial  and  temporal  heterogeneity,  as  strongly  evidenced  in  several 
studies  of  chemical,  physical  and/or  biological  variability.  In  one  study,  the 
Cytosub,  an  un-manned  autonomous  underwater  vehicle  with  an  inline  flow 
cytometer,  was  tethered  for  30  days  inside  a  semi-enclosed  harbor  within  the 
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Bay  of  Marseilles  and  collected  data  on  phytoplankton  abundances  every  30 
minutes  (Thyssen  et  at.,  2008).  After  accounting  for  diel  variations  and 
measurement  error,  25%  of  successive  samples  had  >=32%  unexplained 
variability.  It  was  speculated  that  this  rapid  variability  at  a  single  sampling  point 
might  have  been  due  to  be  genuine  biological  patchiness,  physical  forcings  (e.g. 
winds,  tides),  community  dynamics  (e.g.  grazing,  lysis),  or  behavior  (e.g. 
migration)  (Thyssen  et  al.,  2008).  Although  the  sample  proximity  to  shore  likely 
exacerbated  variability  from  episodic  terrestrial  inputs  of  nutrients,  etc.,  this  study 
demonstrated  that  temporal  variability  in  ocean  habitats,  particularly  coastal 
ones,  is  poorly  understood. 

In  addition  to  temporal  variability,  spatial  variability  may  significantly  impact 
observations.  Unlike  a  terrestrial  environment,  a  single  sampling  point  in  a 
marine  habitat  may  represent  very  different  water  masses  as  currents  shift.  In 
addition,  spatial  patchiness  cannot  be  explained  solely  by  physical  forcing  (Martin 
et  al.,  2005).  Another  high-resolution  flow-cytometry-based  study  investigated 
spatial  heterogeneity  of  Synechococcus  and  heterotrophs  over  a  120-km 
diameter  region  of  the  Celtic  Sea  (Martin  et  al.,  2005).  Repeated  triangular 
transects  indicated  that  the  variability  between  sampled  communities  12km  apart 
could  equal  the  variability  seen  over  seasonal  cycles  in  this  area.  Furthermore, 
correlations  to  variability  in  physical  factors  (temperature,  salinity  and  density) 
could  account  for  at  most  44%  of  the  observed  variability.  Nor  could  the 
fluctuations  be  due  to  population  doubling,  or  to  mixing  from  below.  The  authors 
suggested  that  all  time-series  studies  be  accompanied  by  in-depth  spatial 
surveys  of  the  region  as  well,  periodically  through  the  sampling  duration,  to  better 
constrain  the  percent  of  observed  variability  that  could  be  apportioned  to 
temporal  dynamics  versus  what  is  just  patchiness. 

In  this  vein.  Station  Ml  is  in  mid-Monterey  Bay  and  is  significantly  affected 
by  the  seasonal  Davenport  Upwelling  Plume  which  leaves  the  coast  at  Santa 
Cruz  and  flows  southward  through  the  middle  of  the  Bay  (Pennington  and 
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Chavez,  2000).  Conditions  can  shift  this  plume,  and  biological  oceanographic 
parameters  can  be  dramatically  different  from  its  edge  to  its  middle  (Pennington 
and  Chavez,  2000).  Therefore  one  might  expect  additional  variability  among 
samples  from  this  site,  due  to  movement  of  the  plume,  which  is  a  major  driver  of 
site  biology. 

Lastly,  in  addition  to  examining  particular  taxa  and  their  potential 
correlations  to  environmental  parameters,  the  genome  proxy  array  has  the  ability 
to  indicate  the  presence  of  non-target  strains  and  to  reveal  population  shifts  over 
time.  This  process,  demonstrated  for  the  SAR86  target  EB000_45B06  in  the 
Monterey  Bay  data  in  Figure  9,  allows  one  to  tunnel  in  from  the  overall  array- 
based  target  probeset  intensity,  to  the  likely  genetic  relatedness  of  the 
hybridizing  strain  to  that  targeted,  to  the  similarity  in  the  pattern  of  hybridization 
across  the  target  probeset  among  different  samples.  For  EB000_45B06,  the 
second  stage  of  this  analysis  -  that  is,  looking  at  the  Tukey  biweight  signal 
across  the  probeset  -  suggested  that  hybridized  DMAs  all  had  fairly  similar 
identities  to  the  targeted  strain.  However,  the  finer-scale  level  of  analysis 
suggested  the  presence  of  four  different  hybridization  patterns  in  the  39  samples 
in  which  this  target  was  present,  based  on  the  clustering  of  pair-wise  Pearson 
correlations  of  the  pattern  among  samples  (Figure  9).  The  ecological  relevance 
of  these  potential  populations  was  suggested  by  the  sample  origins  of  each 
cluster;  rather  than  each  cluster  being  from  a  mix  of  depths,  as  expected  if  these 
differing  non-target  but  related  DMAs  had  similar  ecology,  three  of  the  four 
clusters  had  cohesive  occurrence  patterns.  Two  clusters  arose  from  Om  samples, 
each  with  one  aberrant  30m  sample,  and  one  cluster  arose  from  30m  samples, 
with  one  aberrant  Om  sample. 

The  array-based  conclusion  of  EB000_45B06  population  heterogeneity 
could  be  cross-checked  using  the  pyrosequence  data.  However,  two  of  the  three 
pyrosequenced  samples  fall  into  the  same  pattern-cluster,  and  the  third  is  in  an 
adjacent  cluster  but  has  a  high  correlation  to  the  first  two.  Ideally,  a  target 
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present  in  all  three  pyrosequenced  samples  but  with  quite  distinct  patterns  in  at 
least  two  would  be  cross-validated.  BLASTing  the  target  sequence  against  each 
of  the  three  databases  and  plotting  the  results  as  recruitment  plots,  showing 
relative  identity  to  the  target,  might  not  reveal  the  differences  seen  with  the  array, 
since,  for  example,  in  the  case  of  EB000_45B06,  the  TBW  analysis  already 
suggests  all  four  hypothetical  populations  have  similar  identity  to  the  target. 
Therefore,  the  identity  of  recruited  fragments  to  one  another  would  ideally  be 
compared  as  well,  and  this  can  only  be  achieved  with  overlapping  recruits,  which 
requires  that  the  taxa  be  abundant.  To  lay  the  groundwork  for  this  further 
analysis,  the  probe-set  hybridization  pattern  correlations  were  clustered  for  every 
target  occurring  in  all  three  pyrosequenced  samples  (Figure  S9).  The  differences 
in  these  diagrams  highlight  the  differing  levels  of  population  homogeneity  among 
lineages  at  this  site.  Based  on  the  dual  requirements  of  showing  high  mean 
signal  intensity  in  all  three  samples,  and  having  dissimilar  patterns  of 
hybridization,  the  best  candidates  for  BLAST-based  investigation  of  the 
pyrosequence  data  are  EB000_55B11  and  EB000_39F01,  both  of  which  carry 
the  PR  genome  but  lack  phylogenetic  markers.  Future  work  will  examine  the 
population  structure  of  these  two  clones  in  the  three  pyrosequence  datasets,  if 
either  provides  sufficient  coverage  to  do  so. 

In  conclusion,  exploration  of  the  array  profiles  and  the  underlying  causes  of 
their  variability  allows  a  more  refined  understanding  of  target  natural  history,  and 
of  community  dynamics  over  time,  relative  to  most  other  methods  available.  Thus 
far,  we  tracked  the  genotype  abundances  of  268  marine  target  taxa  through  57 
samples  collected  across  four  years  in  Monterey  Bay,  at  three 
oceanographically-distinct  depths.  95  taxa  were  present  in  at  least  one  sample, 
and  most  taxa  showed  differential  distribution  with  depth.  Highly  abundant 
shallow  taxa  included  representatives  of  the  SAR86,  SAR1 16,  SAR1 1 ,  and 
Roseobacter  clades.  Notably,  the  majority  of  abundant  shallow  taxa  contained 
the  proteorhodopsin  gene.  Highly  abundant  deep  taxa  included  representatives 
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of  marine  pelagic  euryarchaea,  deltaproteobacteria  (including  the  SAR324 
clade),  and  relatives  of  invertebrate  symbionts.  All  200m  samples  clustered 
together  to  the  exclusion  of  Om  and  30m  samples,  although  there  was  no  clear 
clustering  of  each  of  the  shallower  depths.  No  clustering-based  correlation  of 
sample  profile  to  oceanographic  season  was  seen,  but  overall  profile  intensity 
“blooms”  were  observed  in  profiles  after  episodic  upwelling  events,  and  possible 
post-bloom  remineralization  events  were  indicated  in  several  200m  samples.  In 
addition,  the  single  depth  profile  from  Hawaii  also  showed  depth-specific  taxa 
distributions,  whose  composition  was  markedly  different  than  Monterey  Bay 
samples,  although  the  500m  sample  clustered  basal  to  the  MB  200m  samples. 

A  unique  potential  contribution  of  this  array  platform  is  the  ability  to 
delineate  different  populations  of  closely-related  cells,  and  their  dynamics  over 
time.  A  key  next  step  will  be  validating  this  array-based  population  mapping 
through  the  three  Om  pyrosequence  datasets.  In  addition,  further  correlations  of 
environmental  data  for  these  samples,  and  temporal  autocorrelation  analysis,  will 
help  clarify  temporal  patterns  in  the  array  profiles,  and  define  the  strength  of 
annual  community  cyclicity. 

Time-series  ecology  in  marine  microbial  systems  is  vital  to  expanding  our 
knowledge  of  marine  microbes  from  snapshots  of  taxa,  gene  contents,  and 
biogeochemical  potentials,  into  a  more  realistic  view  of  the  dynamic  nature  of 
these  communities,  likely  variable  on  the  scale  of  hours  and  milliliters.  Until  it  is 
practicable  to  sequence  large  numbers  of  environmental  samples  for  time-series 
studies,  tools  that  can  inexpensively  and  precisely  track  native  taxa,  at  levels  of 
phylogenetic  discrimination  relevant  to  their  ecology,  remain  an  important  goal  of 
microbial  ecological  methodology.  Furthermore,  sifting  of  vast  metagenomics 
datasets  without  a  priori  hypotheses  remains  challenging  and  unwieldy,  and 
complementary  tools  are  required  to  help  direct  such  investigations.  Monterey 
Bay  is  one  of  the  best-studied  sites  in  the  global  ocean,  and  this  genome-proxy 
array-based  investigation  of  its  dynamics  brings  new  nuance  to  the  picture  of  its 
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microbial  communities.  In  addition,  the  array-based  evidence  for  multiple 
populations,  potentially  with  distinct  ecological  niches,  poses  specific  questions 
for  exploration  in  metagenomic  datasets  from  this  and  other  locations. 
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Figure  1.  Radial  tree  illustrating  the  phylogenetic  relationships  among  the  268 
targets  of  the  expanded  genome  proxy  array.  Numbers  indicate  the  number  of 
targets  within  each  phylogenetic  clade.  Sequences  from  clones  lacking  a  small 
subunit  rRNA  gene  (SSU)  phylomarker  are  represented  separately  by  the  hexagon. 
Tree  was  created  based  on  alignment  of  16S  rRNA  sequences  using  ARB. 
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Figure  2.  Cross-comparison  of  array-  and  pyrosequence-based  target  abundances 
for  three  MB  samples.  Using  BLASTN  parameters  optimized  to  mimic  array  cross- 
hybridization,  all  268  targeted  genomes  and  genome  fragments  were  BLASTed 
against  the  pyroseqeunce  database  for  each  sample.  Pyrosequences  were  assigned 
to  one  or  more  array  targets,  proportional  to  the  bitscore  of  each  match.  The  number 
of  pyrosequences  matching  each  target  was  normalized  to  target  length  and 
database  size,  and  compared  to  the  unfiltered  array  signal  (see  Methods  and 
Results)  of  the  same  clone.  Correlation  lines  were  not  forced  through  the  origin. 
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Figure  3.  Monterey  Bay  sample  origin,  (a)  Samples  were  collected  in  Monterey  Bay, 
California,  at  Station  M1  (red  circle  on  satellite  image  from  XXXX).  (b)  Samples  from 
three  depths  over  4  years,  hyrbidized  to  the  array  in  this  study,  are  shown  in  relation 
to  the  site’s  temperature  vs.  depth  profile  for  the  same  period.  Samples  (black 
diamonds)  were  collected  during  two  consecutive  sampling  periods  (horizontal  solid  red 
bars),  separated  by  a  sampling  hiatus  (horizontal  blue  bar).  X-axis  numbers  represent 
sampling  duration  from  January  1st,  2000,  with  years  indicated  below  and  delineated 
by  dashed  vertical  red  lines.  Months  are  indicated  by  their  first-letter  designations.  42 
samples  from  the  first  19-mos  period  at  Om,  30m,  and  200m,  and  15  samples  from  Om 
and  200m  over  the  final  ^9-mos  of  sampling,  were  hybridized  to  the  array. 
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Figure  4.  Clustering  of  hybridizations  by  sample  and  by  genotype.  Hierarchical 
clustering  was  performed  in  GenePattern  using  Pearson  correlation  (see  Methods) 
and  is  shown  across  the  top  for  samples  and  along  the  side  for  genotypes. 
Genotypes  are  color-coded  by  phylogenetic  identity  and/or  gene  content  of 
particular  interest  (see  color  legend).  Intensity  of  yellow-to-red  color  for  each 
genotype  and  sample  date  Indicates  relative  mean  organismal  signal,  (a)  Monterey 
Bay  samples.  Samples  are  named  Depth_Year_CollectionDate,  and  are  color- 
coded  by  depth  and  by  oceanographic  season  (see  color  legend  and  text).  The 
break  between  shallow  and  deep  samples  is  additionally  indicated  by  the  blue 
vertical  dashed  line.  Clusters  of  targets  referred  to  in  the  text,  “shallow-consistent", 
“shallow-frequent",  and  “deep-consistent",  are  boxed  with  dashed  red  lines.  Red 
asterisks  denote  samples  with  particularly  intense  Om  profiles;  the  30m  and  200m 
samples  for  the  same  dates,  where  available,  are  indicated  by  blue  asterisks,  (b) 
Hawaii  samples  from  cruise  HOT179,  named  by  depth  of  sample,  (c)  Hawaii 
samples  and  MB  samples  clustered  together.  Hawaii  samples  denoted  by  red 
dashed  box;  the  three  shallow  HOT179  samples  cluster  separately  from  all  MB 
samples,  while  the  500m  HOT179  clusters  basally  to  the  200m  MB  samples. 
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Genotypes  Present 


b) 


Samples 


Target  Genotype  Legend 

Cyanobacteria 

PR-containing 

SAR11 

Archaea 

possible  S-oxidizing  link 
16S-  or  23S-  containing 
MB-derived  clones 
HOT-derived  clones 
genomes 


Array  Signal 


Less  More 


EB750  10B11 

EBAC750  11E01 

EB750_03B02 

0eepAnt_EC39 

FR750  01B07 

EB750_02H09 

HF4000  08N17 

HF4000  37C10 

FB0O0_36F02 

HF4000_384_009C18 

FB750  09G06 

FB75010A10 

HF0770_09NI20 

HF4000_23L14 

FB080  69007 

HF0070_t5B21 

HOT2C01 

HF10  45G01 

HF0130  04F21 

Pro_NATL2A 

HF0070  08D07 

EBACreri  2aE09 

HF0010_05E14 

HF0010_10C01 

HF0070  14F07 

HOT4F07 

Pro.MIT  9515 

HF0070  30B07 

HF0l3O_05G0g 

HF10_12C08 

HF0010_11K06 

ProMED4 

Pro_9312 

Pro_AS9601 


unk  GammaproTeobacteria  related  to  S-0)cidi7ing  symbionts 
Aiphaproteobactena  ;  SAHii  ■  subgroup  2 
Deltaproteobacteria  :  SAR406 

Archaea  Furyarchaeota  G2  ,  came  from  500m  Antarctic  Polar  From 

unk  Oeitaproteobacleria 

Gammaproteobactena  :  SARI 56 

Deltaproteobacterie  .  SAR324  cluster ,  ctg_NfSA008  clade 

Alphaprcteobacteria  ;  SAR 1 1 

Alphaproteohacteria  ;  Rhodobacterales  CHABl-5 

Alphaproteobactaria  ;  Hickattsiales  ;  SAHII  cluster 

Alph^roteobecteria  .  looks  like  SAR1 1 

unk  Gammaproteobacteria 

Deltaprotaobactaria  ;  SAR 3 24 

Gammaproteobacteria  ,  Thiotricales  .  ZD0405  clede 

unk  Alphaprotcobactona 

Deltaproteobacteria  ;  SAR324 

unk  Aiphaproteobactena 

unk  Proteobacteria 

Bctaprotcobactcria  ;OM156 

Cyanobactarle ;  Prochloreles  ;  Prochlorococcus  LL  clade  ;  high  B/A  clede  I 
Gammaproteobacteria  ,  K189A  clede 
Gammaproteobacteria  ;  SAR8G-t 
Gammaproteobacteria  ;  SAR 92 

Alphaprotaobactarla  ,  Rhodobactaralas  ;  SAR  102  .  SAR  102 
Alphaproteohacteria  ;  OM75 
Gammaproteobacteria  ;  SAR86-I 

Cyarobectaria  ;  Prochloralas  ,  Prochlorococcus  HL  clede  .  low  B/A  clada  I 
Daltaproteobactaria  .  SAR324  duster .  SAR276  clede 
Deltaproteobacteria  .  SAR324 
unk  Alphaprotaobactaria 

Gammaproteobacteria  ;  SAR86  cluster .  SAR 89  clede 
Cyanobacteria  ;  Prochlorates  ;  Prochlorococcus  HL  clade  :  low  B/A  clade  I 

Cyanobacteria  ;  Prochlorales  ;  Prochlorococcus  HL  clade  ;  low  B/A  clade  II 

Cyarobacterie :  Prochloreles  :  Prochlorococcus  HL  clede  ;  low  B/A  clede  II 
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Sample  Legend 
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•  Om 

•  30m 
m  200m 


Figure  5.  Principal  component  analysis  of  Monterey  Bay  samples  by  array  hybrization 
data.  Deep  samples  are  separated  from  shallow  samples,  and  the  variability  of  30m 
samples  is  mostly  encompassed  within  the  Om  sample  varaibility. 
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EB000..39FC1  , 
ProMED4  . 
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)8j^27> 

‘  t 


EBAC  ^^39H12' 


*  EBOOD.65AM 


36^2C246 
Car«  (92  1M 


b) 


'3 

o 


X^OOrr 


S104  uM 

N<5ie  ^ 


(Si 


.  4^  U02  uM 
X3Qn 


•-0 


8  6-4  2  0  2  46 
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Figure  6,  Canonical  discriminant  (c.d.)  analysis  of  Monterey  Bay  sample  (Om  , 
30m  ,  and  200m  )  array  data,  with  parameter  correlations  to  c  d.  axes 

indicated  by  vector  length  and  direction,  (a)  Genotype  abundance  correlations  to 
c.d  axes;  the  distribution  of  particular  taxa  drive  the  differentiation  of  depths,  (b) 
Nutrient  correlations  to  c.d.  axes;  nutrients  are  dramatically  different  between  the 
three  depths,  and  this  strong  difference  is  recapitulated  in  the  correlations  to  c.d. 
axes. 
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Figure  7.  Principal  component  (p.c.)  analyses  of  Monterey  Bay  samples  at  each 
depth,  with  nutrient  (nitrate,  nitrite,  phosphate  and  silicate)  correlations  to  p.c.  axes 
indicated  by  vector  length  and  direction.  Each  sample  is  desigated  by  its  month  and 
year,  (a)  Om  samples;  the  sample  varaibility  among  Om  samples  is  not  strongly 
correlated  to  differing  nutrient  conencentrations.  (b)  30m  samples;  there  is  a  strong 
correlation  to  all  four  nutrients,  reflecting  the  strong  upwelling  signature  at  the  base  of 
the  mixed  layer,  (c)  200m  samples;  nitrite,  phosphate  and  silicate  each  correlate  to 
sample  variability,  in  distinct  ways. 
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Figure  8  Evaluating  the  genetic  relatedness  of  community  DNA  hybridized  to  the 
array.  On  the  left  are  mean  organism  signals  as  shown  in  Figure  4a,  repeated  here 
for  side-by-side  examination.  On  the  right  are  the  relative  ratios  of  the  Tukey 
Biweights  (TBW)  to  the  means  for  each  organism  (samples  in  same  order  as 
clustering  based  on  mean  signals,  on  left).  This  ratio  is  related  to  the  Identity  of 
hybridized  DNA  to  the  target  sequence.  Hybridized  DNAs  with  a  large  relative  drop  in 
signal  when  assessed  as  TBW  rather  then  as  mean  (darker  blue)  have  a  less  even 
signal  across  their  target  probesets,  and  are  thus  Inferred  to  be  less  closely  related  to 
the  target  sequence  (l  e.,  80-90%  ANI),  whereas  hybridized  DNA  with  higher 
TBW:Mean  ratios  (lighter  blue)  are  inferred  to  be  genotypes  more  closely  related  to 
targeted  sequences  (I.e.  >90%  ANI),  as  in  Rich,  Konstantinidls  and  Delong  (2008). 
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Figure  9.  Revealing  population  heterogeneity  by  the  genome  proxy  array:  tunneling  in 
on  strain  and  population  information,  (a)  Mean  target  intensity  for  SAR86  target 
strains  present  in  Monterey  Bay  samples  (as  in  Figure  4a).  EB000_45B06  is 
ubiquitous  in  shallow  samples,  (b)  Tukey  biweight  intensity  for  the  SAR86-II  target 
EB000_45B06  (as  in  Figure  8).  By  this  index  alone,  subpopulations  are  not  strongly 
evident,  (c)  Pair-wise  Pearson  correlations  of  the  signal  pattern  across  the 
EB000_45B06  probset,  between  every  sample  in  which  it  occurred.  Samples  are 
clustered  based  on  similarity  of  probset  pattern  (assessed  by  Pearson  correlation). 
Four  major  clusters  of  samples  are  present,  delineated  by  black  dashed  lines, 
evident  in  both  the  clustering  patterns  and  in  the  matrix  diagonal.  Red  Indicates  high 
Pearson  correlation,  white  is  intermediate,  blue  is  low. 
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Figure  10.  Sample  oceanographic  context  Panels  show  nitrate,  nitrite,  silicate  and 
phosphate  concentrations  through  the  sampling  period.  Black  diamonds  denote 
samples  hybridized  to  array.  Blue  arrows  at  top  of  each  panel  indicate  samples  whose 
Om  array  profiles  were  particularly  intense.  Red  arrows  at  bottom  of  panels  indicate 
200m  samples  whose  variability  was  correlated  to  silicate  and  phoshphate  (Figure  7). 
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Al|>(li>|lllll<n>lldl  t-llil 
A1  tdiid 

AJpPi  ^pinkii.ihflr  rwid 
AlplMpi  u(vi.>l)d4.t.ui  Id 
Alpliapi  i,Hcol>di.t.ci  Id 

Aljilld||i|lt-*llldl  1-1  III 
_Attll  1 .1  (IK  it  -I  lll^  t  -I  .  I 
AJpl>  i^iiVi'.'^hJH-rpriA 
^  AIpK^pi'iVd.-ihArtoi  i;i 
,  Alptidpi  iXcubiKlVI  Id 

AJpJiiipi  u(«<jbdi.lv<  Id 
Alplidpi  uly'jiwi-UM  Id 
Al(itld|ll!i»-il(jdl  [•d  1,1 
Al|4ld|lll>lMl|MI  l-l!il 


,  Dum  ciiiii- 
,  Iiiit.illv-  SAkt  I 
,  fiiiliri 
,  OMTI 
,  I^M/b 

,  I'di vdtdiyuldivi 

,  H1ii/tit>iiii»»>i 
:  K1ii/r4ii.il-s 

^  RhiyrthiAiAA  r>Aivlb/*t-iHiirri 

,  Rhi7<-.hi3i-^i  n  xifi-' 

.  l<lK-yciLidi.U*i  al«i 
,  klKnJyLidi.l,cidt<i 
,  RlKibubdcUidlt^ 

,  Kli<iili>l»- I -I  i>i-s 

,  MIkiiIdIi-i-kiI-k 


Jt  ISMIV 
‘‘r  i>mi. 


>|i. 


nrcotHii  .iKii  PH.iiiiiM 


Prril«v>hA<  tr'rld  ,  Alpbdpi'<T»'’v»h.%rtfMid  ,  Rh>'3rlor.A-'r'''f ah's 
I'luly-ubdclyiid  ,  Al|.4idpi  ulviAMylvi  Id  ,  KIh'^uUii*.  l«i  divi 
I'l  uli.'ulid<.ldi  Id  ,  At|.4idt>i  vjlyoikM.li.Hid  .  Klio(>oLioi.lyi  dloi 
Pi  iilMilin*  t-i1rt  j  Al|4id|ii  iitaHilidi  I— i  l.i  ,  Hlmitiiliti*  l-iiil*-! 
PiillMihtHl-ild  ^  Al|ilid|ii  Dla-illdl  l-l  Id  I  Hlnii)iil)ii<  l-l,ll•-i 


riSii7??7o  hdrmiKldo^ls 'iTi-rjAni 

Atki"  1  0 1 Ofiflrt 1 3  R-K«>vdrlii<  «p  TMim*: 

'.1'000031  ^b'tk.i£iiK{C»  ptXTic^Or' c>ii- J 

(loS  jlilv7bJViUi.ild  >ly»ldld  fcj/ 

CllkSklli)  ’SiHI»dUfit-i  -Hi.tC 

(  INSkll.'  '■ihUHiiUii  IM  <•(1.  MA.S  H  1 

AviiRi'i'n  'rrviVRO  1 3111  to? 

Rii^diiwiii*:_l  iTr  r/OOl 
ftriiPoy.irHK  TMI-H*; 

S^p'ji  nei  wy  >  J 

i_i1dtdld_fc_3  / 

SM«iliH).t.O<l“il  to 
_S  iM  lull. 11  I.m_NA'.1-I_I 
rnORO  L03Rl'i02 

nrorr>ntiA.-Pdr1a 

PriitfyipAcP.Tla 

I'loievUdylei  Id 
I'ltXyuLidclvi  Id 

PiiMMili.nl-ilfi 
PlIllMlIl.l.  iMld 

f*nirMib.vti->r1a 

,  AJph.ip'<iP<>'iha<‘PAna  ,  Hhr.ilnlwPprAio^ 

,  Alphspi  nPi'-'ihirtwia  ,  RlwV>hAi-''Aral<K 
,  Al(.iidpK.<eyl»d<.t.eKd  ,  WxjOuUicteidiei 
,  Alii  1  dpi  uivobdclyi  td  ,  KlKjdut.>i*(.(.i:idk*i 
,  Alfili.iPi  .l-iili.MiM-.i  :  RIkmIiiIi.hI-i.iI-h 
,  Allii.iP'  .I-i.Ii.h  iMid  ;  KImkIiiIi.h  l-iii-.-. 

,  AlphipKiff'iiharr<H-a  piitAfivr  Bbo-iorai-Porali--. 

^  * 

r)Sii733ft7 

AATRfllilOilOOO  fth<vlrtPdrtdrAld<  HTrC3?’'5 

IH/A  IliOieoUia.'i  HfCLilbO 

III  liooso 

Ml  liiill-M  )l|  /!>  IO|-H)S 

ill  1 . .  jll  KHUlJllf  Ih 

III  111111-4-  HIIMItll  IIK  Itl 

ilphd,iriT22*:'i 

KhuOobdUei  olyi_M  tLC/l'jU 
MHX)HJ_04M/1  ” 
lilt  Ml  7(1  lOIXlA 

HI  AOIKMUI  IP 

JH  IKlJ^TltMllH 

rVorpHiP/w-Pr'ii.i 

h  ulcubdclei  Id 
I'l  UlyvbkK.ii;l  Id 

PlIllMlIhH  l-llH 
PiiiImiIi,kI-iU 
PiiitMitini  1-rlH 

Aiphapi  ntiwbarPM-ia  .  RNirtonai^orAios  OMH.''  -•laiiA 
,  Alpiidpi (Aeubdclyi Id  ,  RliydoLtocU- idles  ,  kCA  dude 

1  AlplidpiiAeobdcivi Id  ,  HlKjduboclei dies  ,  keseubd-.Ui  llki.- 
,  AJiilia|M  1  ii-ohr><  1*H Id  ;  MPiodijrmx i-idi-s  ,  koHMiti.Bl-i  ik- 
,  Al|>)id|M [■(•MiliHi l-i Id  ;  klvnliiliix  l-idl-H  ,  knsMilim  Iih  Mk- 
,  /Utiliatri(ir-iilid«  l»M  ki  ;  HIxiiiotifB  i -i  .ii-s  .  '^Ak1ii3 

n-hinivi 

,rrvHyi  I'lni  1 

_J>no3'irm_"  ^ 

PrnI  rHihJ*^;^rorl.i 

.  Aiph.ii»irit<'r)harP(Hia  1  RI>-)dobarP<'r-ilk<' 

in-h'Hiv’ 

Iir70  AIKfHi 

Mrnn7o_iiK:-iii 

PriilPOhJif-pf'rla 

,  AlphapKTlflfiharfi'ria  ,  Rhodo<;plrlllildi; 

II  lioo»y  Hb300_01Ul4 

-II  lioo^y  HI-400U_,.'4Mf.)J 

ml . .  V  <')7AIIm 

III  hnilH-  ^lll  I)ll/(l_ll?r07 

AAtA/OOflOOnofl  Poli^glhdftdr  iihkiiir  l1TCCl!X13 

Cl'y-XAJBA  ^Pf:lagib<n.Kr  ubfquc  HTCCi0i>2 

in  booiu  He4oooj;3a4]uo'x:ia 

III  liiiiiH-  _ tniiMiKi  iNiiit* 

HPlj/UUJ/lUlA 

MFauy0_/4MO3 

1 1  1110  lii/Aiio 
jll0ll/0_()3ll)/ 

,Polagib4rlrH-_irrrri002 

I'yldpibdcloi^H  ttC  11K»2 

HP  4ijyij_3{l4_w.h:  IB 

IIIMOIOJNOIP 

I'l  OtL-ubiKlUIld 
I'l  otL-ubdclyi  Id 
PlIllMlIlill  I-IM 

Pi  iilMiliiii.l-ilii 

Priitc-nbarPorla 

I'l  ulcubdclyi  w 
I'l  olewbdctcild 
PiiiImiIi.hI-i  In 

,  Alplidpiijt.yobtH.t.yiid  ,  kivjdusplillidk’i 
,  Alpli  jpi  vAeobdctei  Id  ,  IOkkJusi'ii  llldles 
,  Al|iiiii|ll  iil-ri||.M  l-Hd  ,  |tlXKkl<>|>lilllijl-s  ,  OM  /S 
,  Al|iiiii|iiiil-tilhii  I.Mld  ,  Ilk  k-ltsl.il-s 

,  AlphapcoPdohafPrHia  ;  Rkkottslaln?  ;  SARU  dado  ,  subgroup  la 

,  Aiplidpiu(.votkM.tviid  1  KickcUwdks  ,  bAUll  .Iddc  ,  iubgiciup  Id 
,  Alpii.jpiuicobdi.lcrid  ,  (ikkcttdiaitts  ,  SAKll  claciv 
^  Al|ilia|iMit-iili<MtM  Id  -,  Hk  k-llslales  ;  SAk  1  1  i.biil- 

in-hn.iv’ 

.liro77l>  17003 

_ nro  770  1700 3 

ProrrHihAi-lJ'ila 

^  Alpbapror-or^crorla  jj^RkkPtt<Ja|o«i^  SARI  1  riado  ^  PolaQihartor 

AY4'jBo33 

AYAwe.!/ 

.n-h.iiiv' 

|nr4rKVi^i7i‘ 1  fi 

eB3iO_llt01 

[tn7SO_03HOS 
[i  f  100*0.11111  i 
rnoMjfl-iroi 

Mrioor)_i7'r  fo 

bBA(./'dU_llE01 

riuv  102110^ 

'll  Ilio^k-UMI.I 

rnosoj  fldr-Ti 

Priilpohrypr’rta 
I'l  lAcubdctcrta 
2 

Pi  ulwbdclLi  Id 
2 

PiiitMiii.K  r-ild 

P  mP»Hih  rl.i 

,  AlphaprrtPPoharPorla^  Rli-kott-slalo<  ^  SARi  i  riado 
,  Alpiiaptoicobdclorid  ,  Kkkettwdk's  ,  bAKll  daOc  ,  iAkll  ,  subviuup 

,  Alpiidpi  uivobdcloid  ,  RrckcVUridlys  ,  bAkll  ddOu  ,  SAKIl  .  subgiuup 

,  At(ili.i|iii<lMilidi  liH  Id  .  milts  iliixkivilMhi 
^Alpb  apirAontiafteria  ^Rtvtooha.-rof 

r)Q30';3io 
AttA-WMil 
AVO/I-JB'J 
AY/ 14  )kk 

Ml  lk)ll--3 

Anrfii^”  34C04 
jkS0t/<._0UU04 
|VHACi  wJ^ibUO'j 
-HAl.lMtJ)3<;i  1 

Anrros~34ro4 

_bbW>ULH34 

'tbACn.HJ_2‘jU'Jb 

I  0i-i)_0'31.n  1 

Pr«itryi»iAct<-'rta 

l'iutyut;<>clt;i  lo 
I'luU-ubtH-U'lld 
Pii)lHilin<.I-ild 

AlphapiTitoiibafPoria  ,  RrKonha.-t'^-r 
^  Al^idpKXvybdCtvi  Id  ,  HiAS-ubactyl  like  ;by  Os-st  bLAlsI  '  its} 

,  Alpli dpi  uieybiKlvria  ,  l<usogbayli.>i  like  bdcU-ilii  puf/bcH 
,  AlHidtiKit-fili.KtMid  ,  SAkI  |(>  ,  iiiildtiv-  sAltl  It)  1 

111  ll>  i  11  33 

HI  111)1  iMUl  23 

Pll>tM>()ll<1-ll<1 

^  Al(illd|tiiAMilliKtMla  ;  SAKI  ]^t> 

III  Ikiiin— 

jll/l»_l/004 _ _ 

_jiijir)/o2i /IMM _ 

'piiilMihii<.i-ild 

AllJlliltll^MlIhll  t«H  til  ,  SAkI  Ih 

1  n  ■  hni  1  v' 

*lir0070_0'5133_ 

i»ron7n_oii3? 

ProrrHibarPorta 

^AlphaprriPrnharPrHHa  ;  SARI  1-. 

III  llllllV- 

'iirnd7o_i4Ai? 

.Hm0/0_34tll 

'fcBAC_40LXI/ 

1  llOIK)_  J/<aN 

^fr^frtfO<WCftHjitord/«  JJ  Kj<  j^bkl 

*iirori7n  iiai? 

’HPU03U_J4tll 

tbyjy^Atxw/ 

1  001)0  :i /coo 

_ l^llllBdllHjIU  l_/Sk-1 

PfOPfHiPll*i-P‘Tla 

I’l  Ut.CHjbvK.U'1  Id 
1*1  ulcHrLkM.lt.-i  hi 
PlIltMlIl.Ml-ll.l 
Pi  iKmiIiiM  I-i  Id 

,  AlphaprriToohacPona  ,  SAP  1 16 
,  Al|jlidpiuivotkK.leria  ,  bAKll'd 
,  AlptidpiiAouWlufid  .  SAKIIU  ,  SAHllO  1 
,  Al|itinpriil-oli-KlMli)  ,  SAkI  lb  j  SARI  10  II 
,  AllilidiiKit-iilidi  t-ik«  ,  Splitm^miiii'mldl-s 

in-hiMiv> 

itro'’Hio_34ni3 

iirowi^^/Anp 

Pr<itry>PafP<'r1a 
r  lYThriibarPer 

,  AIpbapKiPoohafTeHia  .  Sptiin<)«irrvonarlalo<  .  TrylPHohartotaroar  , 

II  liooic 
tf0«9400 
CPO0O316 

In  l>ou«# 

;HHO_30A33 

’Mm6lU_JiJA23 

I'l  OtCUbdcU'Hd 

.  Al^idpiuteobdcteiid  ,  I3l_112  -^IdOe 

6BO00_41B09 

Potaromonai  sp.  JS666  drofl 
HF4000_05M23 

Not  yet  valldlv  described,  OM43  dade 
HTCC2I«I 

tB041B09 

PoiaroniDnas_JSG66 

HF4000_05M23 

MethylophlLales  PfTCC21Bl 

Proteobacterla 

Ptoteobacicfla 

Proieobactefla 

PrnteohArPerlA 

,  Betaptxiieobactefia 

,  Betapiuieobacteria  ,  BufVhoklat  tales  ,  Potaromonas 
;  BetaproieobacteOa  \  Surkhoklerlales  ,  DatfOa 

•  Betaproieobacterla  ;  MettiytophHalos 

Ay45e64S 

In-house 

AY4S8647 

In-house 

EB080_L12H07 

HF130_04F21 

fcBOO0_3bAO7 

HFOOiP  04H24 

eB80L12M07 

HF0130_04F21 

t»000_3bA07 

HrOOIO_04H24  _ 

Proteobacterla 

Pfuteobacterld 

Proieobacterla 

Proieohncterla 

;  BetapiYiteobacterta  ;  Mtrotomonai 
,  BetapicAeobacteria  ,  0H156 
,  Betaproteobacteria  ;  OM43 

;  BetaproieobacteOa  ;  Rhodocydales ;  Rhodocyclaceae  ;  Zoogloea 

'll  liiiii-- 

Tnii)ti:ii  I,  s 

HIOOIII  lOIOS 

PllllMllli*.l-lld 

,  ll-lld/41>Sltllll  HIllHlIVMkHIS  ,  D-ttiipllllMlh.ll-l-ll.l  ,  SAR.I31 

Hi  h.-iii-w' 

|iirix>7n  n7rio 

nroo7o  nyrio 

ProtrHihai't.'rla 

.  i1Hra/---nsikin  siibrtivisk-iiv;  ,  noltapror<^nhart-  ila  .  SAR13-1 

iii  h-->ii<--' 

lim070_nfX3l  _ 

iirpo7n_i'in3i 

_  Prnfoohai't-Tla 

-lolta/npsUnii  SI ihdiviO-ins  .  Doltapr<it<~-nhai ^l^•'l1a  ,  SARI, ’-I 

>11  ll-JOVU 

VuilJUJ/SOtJ'J 

HPU130_bjOU'.l 

I'l  uicHjboctyiid 

,  ik.4td/v-pjik.-ii  sHibOivisiLHii  ,  Lktldpiuieut'actyihi  ,  bAI<324 

HI  I100-.K: 

jHFUUIj“2lJi/4 

HHJ130_20J24 

I'l  yUHiUtKlyi  Id 

^  tk-hd^-.-piik-ii  tdjbOivisKHii  ^Lklldpiof.-ubdcleii.i  ^  bAK3/4 

III  llllllS- 

|l II 111410  HDIIJ 

'iiin/tio^Hni  i 

PlIllMIll.lll-l  lit 

,  it-l1.)/-)isll<>ii  HiilHlIvIskHiH  ."(Ml.ipiiilMihiii  l-il'l  ,  SAM. 134 

i|i  hiiii'-^ 

'lllO/i«l '  ikl  3  1 

’11111300  .(ki  / ) 

PiiiimiIi.mI-i  I.i 

,  itt4lii/»^(Klkiii  Hulntlvlskil  IS  j-  i)-llii|iiiilMili.K  l-i  III  ,  SAI(.t34 

Hi-  llOIIU- 

llir/oo  1OM30 

’iiro300_'10N30_ 

ProtOiiha.'ti-'rla 

^  dolra^r.jpsikin  s« iMiviskins  r)oira£>rritr-r.hart.->r1a  ,  sAR'-.'*  i 

10  h.iiiu- 

]i  irosiiomAoi 

'iifn'nn_oiAn-i 

Priifr-nharPi-'ila 

.-1f'lra/.’P<;ito<i  SI  ihdivi'vkins  ^  noltapmtr-oharl'-'da  .  SAR134 
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Tabic  1  continued 

'  ■■  ■  H 

nr  rni  ■"V'n  ’o 

1  ■  .  r.'f  ' 

i|i  •- 7  ffnlSN  1  s 

„  '1|»|.U- 

'  1'  1  iii^i  «  11 

'll  ;i  iin_i'ff  'll 

■  l-flH- 

i-  lll'J  .MR 

■11  ‘ni.jifir-i 

-f.  llj- 

•iHiHK.. 

•Hjh 

•(•  M  IJIV. 

Hi-  il.'ii  MltJli' 

m-hofiw 

UFOOlO  0R8O7 

NFOO  10.08807 

AYASRbV) 

FR7S0  01B07 

FB7V).blB07 

In  hotibA 

HF  2ll0_07fiJ0 

Mr02OO.07GH) 

in  hvoba 

HF130_12L15 

MF0130_12L15 

in  hoci-ba 

MFIU  OUIO 

HFOOIOOIJIO 

in  iKAiba 

HF70  10102 

MFCK)70  10102 

AVaSBOll 

FB7'»0  01B07 

FR7S0  03IW? 

In-boijsa 

HFO5OO_0IL07 

HF0500,OIL02 

In  PouAa 

HH0_U1011 

HF0010_1»013 

III  IHHIM! 

HF4O00  22B16 

HF4000.22B16 

Og2b74'35 

OaapAnl_lH2 

0q2ft  74  95_DaapAnl^lF  1 2 

DQ26749t) 

OaapAnt  32C6 

DQ 267496 .  OaapAnt.  3216 

Erin6972 

FB0B0.170FO4 

FB0R0.L?0rO4 

[•'M.-*,!  ♦■I.'I  A  ^  iif.i ,  ,  sA.R  l  ’  i 

(•i  ■tl•.l^vtc'1  *  ^  .liW'  »'ojn.lrtr'  rvim-«-ir"i>fni  r^TiJi  ^  sARfl 

,1  '  . Ml, 1.(1',  M, !•,(!,:-. «i,  )  ♦•II  ii.i.itKtithH  >iH  '  '.AHI-'t 

I  dHii!'..,,  tni  -kl'.l'H^is  l-l'  -.uUI",  i  ♦•II  nil  '  HI  M  ,  './VW  I 

I**  . . .  „  ,»h|i,|  •H^I^.|.^«I  M,UtlVWlHl-.  ;  hIi  |(•I.^H.•^|.4, 'HIM  ,  '.AM  I '■?  I  I  IviHI 

liy  N|'>A.-'1IM  •». 

!'■  •t.T.i  .1. 1.n  ,1  I  VJ...V',I.  .Ms  -  .  n.'i  M  s4mi;.:,i,  ■ 


Prot«>ba<t«irtA  .  4eirA/«psrk>n  suhrtfvmoos  ;  CXtAproteohActcria 

Protaobartana  ,  *uhdlvl«irtr«  ;  OHtA()(T>tMrui(rtcnA 

ProtfwbaitMlA  ,  <lAKA/wp«ikin  ^ubdlvlAlom;  Dob  Apr  otwibAct  aria  ;  DtxulIntMctarAii^^  . 

Nlti(Mpin«  Mkit 

ProtnobAftwiA  ;  iIvKa/iipiiilon  AubdlvlAinns;  MtApiotaobAttmtA  ;  MvxotCK.cA<a5  ; 
E4«FllcD  «.UOe 

ProteobdclMla  ,  ^Ha/«pwk>n  «ubdlvi5torK,  DWtoproteobaclerta  ,  0M2/ 

PrvtMjbacUf  la  ,  «>clW«p>ilo<'  suMivistvirs,  Dwttapi (Xwvbactvrla  ,  WZ7 
ProtAobATtartA  ,  ilattA/ap<ik>n  viMIvHlons;  DalfApitrfaobArraftA  .  SAAaOb 

ProtaobATtena  ,  ^etta/apsXon  subdtvKioof ;  OaKaproteobActabA  .  SAA40b  c1u«w 
PiotwobMtMiA  ,  italtA/apaMoo  «iilxllvi«tor>s;  DabapiotiPObAttafla  ,  SAA40b  rKiala* 
AailOORclaba 

ProtiMba<  tCfia  ,  daba/npaHon  sutkllviainoa,  D«A apt olaobAtl aria  ,  SAR40P  lAistn 

ESP200  KIO  Lladc 

Proteubactai  la  ,  Dallapf  otcobactaiia  ; 

Protvobactaiia  ,  OpHaprotaobacter^  , 

Profaobartaba  ,  DaltApraTaobACtabA  ;  NitrospInACPaa 


ln-bo<jsa 

EFIOO_9lt111 

EF100_93H11 

in-bottba 

HF?00_40H?? 

HF0200  40H22 

Mot  yat  valklly  dascilbad  TW  7 

Altai  i>nx)nAiJAla3_TW. 

CH959J02 

Ait^romon^s  iruKl*o&t  Oaap  acotypa 

A_mAClaodll_Oaap 

AAOHOl  000000 

P»aM7cia/(ar(VTxina>  lunrcaU  D2 

Pb«udo  UinkAta.D2 

tn-lHAiba 

MF4000_16C08 

HF4000_lw:08 

AE01/i40 

trJmfnartna  torfwens/s  LJTR 

1  lolbienAlf .  L2TR 

Dq795?37 

Anrfo«  .04003 

Antfm  04003 

AY4SR64A 

FBOR0_t3lf09 

Ffi0«0_L3lF09 

Mitrococcu*  mobifis  Nh  231 

N,mnblll«_Mb_23ll. 

Protanbartaria  ,  GammaprDtaobAi'taria  .  AGG47 
ProtaobArteria  ,  GAmmAprotaobActana  ,  AGG47 
Protaobaitada  ;  OAmmaprolaobActai  la  ,  AltMitrooAilalas 
Protaobactcfia  ,  Gammapf  otaobattafia  ,  AltaromorwibalaA 

Pi  oUraba^tar  la  ,  GammApfotaobactai  la  ,  AltcrpiTwrvaiialas 
ProtaobAclffia  ,  GAmmApro(«pba(.t«i  la  ,  Altai  omonaUaln 
Protaobactat  la  ,  Carnmaprotaobacterla  .  AKaiomonadaki 
ProtaobATtarla  ,  GAmmAprDtaobAftaba  Aiilcnc<)GO'  16 
ProtaohartariA  ,  GAmnnAprfttaobAftaria  AUmCbblVn- 1*1  clarla 
Protaoharteria  ,  GAmrrutprotaobAi'tarta  Chromaliala* 


In  h<Mi«a 
In  hoii«a 
Jtfi  J64U063 
111  huuvr 
In-  ho»i4a 
in-ho«i*« 
In-bwi^a 
In-boosa 

in 

in  bouK 
In  hooia 


HI  ?I»0_4IF04 
Hh0;7O_27E13 
EBOUO  6^AU 
HFIU  0IE20 
HFIf)  I6W; 

Hr7n_owx)7 

EB0n0.37rO4 

HF4l)00_3<illU 

f*Minomon*s  Ap.  MED121 

MF7n_21F0« 

MF4POO_13G19 

HF1J0_2SG24 

HF4U00  2301^ 

Il-IH-  I  I'fVV, 


MF0200_4tFM 

mfo;7o_?;ei3 
taopo  65A11 
HFOUIO  01E20 
HFOblO  lAMIS 
HF0b70_0«CK)7 
EB0O037rn4 
HF4iX)0_3bl  lU 
Mai  lnomona»_MtD  1 7 1 
HF0i)7O_21FO8 
NF4000_ 13019 
MF0130_2iG24 
MF4000. 23015 
ni>8')jji'ni'  I 
iinoiii_,iMu  .• 

'  ir 'I  )h_oir^.i 


In  rwuAA  HFOVX).  17004 

AY4b8637  bS750_07H09 

in  MF4000  19M20 


HFOSOO_  12004 
EB750  02M09 
MF4000  19M20 


ir  I  1-.  »n?i 


At  •.  I.  tb  IV  JIA 

A  .j-4  te  II  4  p  .. 

ri-  .  tl  1  '.u  :tt  KI  .J,  I  n- 

■  P.M  '■’I.-  ' 

BrOAn  »l  J';7B<V  HFOOlO  0V14 

ln-hoi»s«  HF4000^23114 

AY/4<H9n  allAI  i«IJI/01 1 

<  PoniMiyij  vthnp  ««-bar  I  s !  H 

I  1I/74I7.1  Vlliilo  S|ila».i|tilii*,  17IKII 

(.HiTOiOOU  VIIaiw  bp  MtU277 

N.'A  Vll^iio  bp  ibMAI  3 


'll  11:  1  -in 


,ct*  Umiik 

tb  IHJ  4.P 
ib  WI  ■l.l'O* 

■flAr  ■’'i'hT’i 

HFOnt0_0SfI4 

Hr4n00.23Ll4 _ 

I  llA4  li^.l_0/I)l  1 
'yihiioJI^iliailJfSIM 
V.bplal>^l^^l^|^_l. .>111)1 
Vlblio  Mf0727 
VIlMio  SWA  I  3 


ProtiHjbAi.tm  lA  ;  Gairfnapiotaobm.taria  , 
Protaobactaila  ;  GammaprDtaobAitaria  , 
Protaobactea  a  ,  GArnfnapn}<aobai.tana 
ProtvubActw  14  ,  GAmmapn)(.aobA*.tri  la  , 
ProtaobArtariA  ,  GamnnApriotaohArtafiA  , 
Protaobarlar  a  ;  CAmmaprotaobArtarla  , 
Prolaobartaiia  ;  CAinmAprotaobArtarla 
Protaobai tafia  ;  GAfnmnprotaobActailA  , 
Protaobattafia  ;  GunmAprotaohAct  an  a  , 
ProtMUbartcfla  ;  GamnnAprDtaobactai  la  , 
Protwobai.taf  la  ,  GainniApnitaoba«.t«i  la  , 
Pi  otaolMictat  la  ,  Gammaprotaobactai  la  , 
ProUfoba^laria  .  CanwnAprptaobACtena  , 
Pi  f>tiN>h^t  tiH  1  ^  1 1  nnrn^Uir.vitiA-'tof  ,b  , 
Pi  ntMItiAflil 'A  ,  r,  ♦Pim^u4.M.A.‘1ni -A 
pi  itmUAi  lPi  A  ^  i%mtr>A|v  1 4 ■■‘pr.A.t.'.-iA 
Pi  .cl♦Hllllll  Ihi -H  ,  v,,iiiiimn)i  iiiiAilKli  IMI  III 


ChromatlalM  ;  Blvalva  aniloAYniblonl  itaila 

OHB  2CIu«Iik 

bBOOO  .63A11  clada 

KieVA  clada 

KiaOA  clAda 

K189A  clAda 

Kiel  119  Clada 

P4  clusttN  (cloMi  to  OMbO  chiAtaf ) 
OcaanofpirlllAlaA 

Ocaanotpli  'llalai  ;  Alcantvoi  ax  Ilka 
Ocaanoapli  iNalaa  ,  AlcanIvoiaK  lika 
Ocaaiiospa  illalaf  ;  OM182  clada 
Ocaanoapalllalab  ,  OM182  dada 

'.^••IfiO 

r'MtiO 

_ 

I'Mbll  'Imla 


P' .Hh-iImi  Ihi  .1  v;-t<'>li.<ijii  |4-M|.(f.  Ihi 'll  -.‘MCij  <  P-lfx)  IhiIh 
ProtaobActarla  ,  GammAprotaobACtana  ;  Paaodomonadalab  ;  PMHidomona^ 
ProtaobactariA  ,  CAmmaprotaobaciana  ,  SARI 5b 

ProteobAclaiia  ,  GanHnApfptaobActai ia  ,  SARIX)  Oubtai  ,  EB750_02M09 clada 

-r«sir.»ria( '  ♦  •,  ♦'nrnApr-.f.vr.A.  Ii‘i  ‘nbR.'n*  I  .iIh  ■  itAP  '  ’  |.♦.^H 

p.  ♦  i,»'nfn;s|>'T4.»-PA.  Ih  a  ■nAP'«i.  1  .tiHi  iiAP  '  '  ,  I.v1h 

>  'ti-H I m  X  -♦••'n-Ajii  .t-'  PaIi  a  ■sab Ai^  'll  ^l,  f  iiAb  1  '  lA ‘I 

,  1h<i*  -'I  !«  .  1 1  ♦  l■lllt^^>^|.^Hll|.,l,  Ih’ 1,1  '.AKHr>  •  li  *IHI  •.AKi4i|.iih 

P-  -4».i ii  .1  ( HI  -  .  I'liirtjxMH-t  ii.  Ih  ,1  -.AKHn 


sAKal#  .. 
'.AK?li  ,, 
lAK  «'  .1 

^AFi|. 


Protaobactada  ;  GAmmAprobaobActarla  ,  SAR92 

Protaobactatia  .  CamnAaprotaobActarla  ;  Thlotrlcalat ,  ZD04n^  clada 

Pii»t««il>.ri,taf  '*i  ,  lH»i!.iiiripri4iK>lim  lai  Ui  ,  |i>i<4liva  Vlt)ilr)fvilt»s 

l>ii)lMihA<tHll4_;.(VAmiti4p<iitaolMKl«filrl  ^  Vlbi  lOfMhai  ^  Vibilc 

Pf otmibAcIm la  ;  (lftrniiwairM«iolMi(ti«lii  ,  VIbi lofiAlwx  ,  VHnlii  ,  bpkiMlIfliiH  ijirMip 

Pt iXvul>actaf  lA  ,  Ga<iiniapic4«vlHM.ta'id  ,  VHx  lonaKrb  ,  VIbiK;  ,  bpkndidub  piuup 

pi otccH^acIvf  la  ,  GainniApfUtaoUKtana  .  vAIxtonalvb  ,  VKato  ,  bpkndidwb  piwip 


In-nraiu- 


iir in  77r.n 


iirontn_22r2i 


PribtaohArtaf'A  .  GAmfnA|VT4i*obAi-tanA  ,  VlbrlonAlf-c  ,  Vlbrlc/PhofnbAflarlitm’ llko 


1  R  IVIMI 

PbototHhitAnifn  profijf^ijfr>  SS9 

PtMHi>l>mlf«l|ini_SS9 

m  fxBHa 

EB0R0J93H08 

£B750_07C09  “ 

In  hofjM 

E8080Il9  311118 

‘E6680_L93H08 

0QCI68067 

Me013K09 

00068067  MED1JK09 

AY4 58640 

EB000_65t>02 

EB0OO_65t)O7 

AY458643 

EBOOO  47H08 

EBOOO  4;h08 

AY3724S3 

ANT  32C12 

ANT12C12 

AV1724S2 

ANT.RCIO 

ANTRCin 

in-honsa 

HFO‘SOO.O<1B09 _ _  _ _ _ 

HF0500j3«ei»_  _ 

■R  .■Il_['*h|:  lAflH 

1  )S/i>n^HiN 

IH^I  nHHftfi 

Bl  III  I'HiAf 

jKiObHIM>«_KI  in  /non 

i  Hji.n-  •  bb 

Vt>MI  1 

i:KJt>n5/55”f>bAll  1 

LV7I,'»»«4  / 

••U-d'.bAL4oA0o 

U0088H4  /"McbJc'BAC.4oAao 

UI^I-  '  >■  i  '  '.a.' 

Med-.iJAf,  SJHO 

OOO/i/'Hi  Madc»}A<.87Hd 

•■k-di-bAt.  '30 

00,0  ■'  '4^  i.FK-dcb  AC  3-4<.  VO 

•M.-bM-flAi'  1'A'Ofi 

r»0n  7  7 '1  '1_  Mn.1r.nAr  -i  I V  PR 

Pi  oltnrbai  Ira  m  ^  <Ml<nifi«ipi  (4f^)bn(  Im  m  ,  Vlbi  miMlfai  ^  PboliibAi  tMlliri)  |W  iitiirKbifn 

Protaobactai  )a  ,  OammAprotaobACtana  ,  700408 

Protaobat.'tef  la  ,  GammapfDtaobActai  la  ;  ZD04 1 7 

Protaobactafia  ,  putativa  Canifnapiobrobactiftia 

ProtaobaclafiA 

Protaobactai  la 

ProtanhActadA 

ProtaohAftadA 

Spkochaatat ;  SpirocbaatAla*  ^  RhteobiAias  ,  Spkoebaata 

iinkntviii  ,  i>MSP  ,lM<jiiiil4lHMi,  MinsoliiHVilf- ilt-yi  .1  ilnl  iii  1 

iinklii'invii  ,  PR  II  hI|Pi>>|>"iI>«4i,'M  MIaI  IAI  .  IihIi' 
uiikiiv*vii  ,  IK  II  AlptMpf  .'tculiiKtn Kil  111  '.link 
vfiikiR'Wii  .  PR  fi  alplMpf-ili.’ubjcteoiil  I’M  i.ijd*. 
v»"kin.'»vii  ,  I'M  II  j'ptiiipi'.'tvvOiKtCf  uil  I'M  iJixk' 

Hokfif^wn  ,  PR  n  Ai|>nApfr>tfnbArti-4iil  PR  rMFr 


127 


Table  2:  Summary  of  Array  Targets  by  Clade 


#  of  Clones 

#  of  Genomes 

Cluster: 

Targeted 

Targeted 

SAR86 

11 

- 

SAR156 

2 

- 

Thiotrichales 

2 

- 

EBOOO  65A11 

- 

- 

Oceanospirillales 

- 

1 

Alternomondales 

1 

4 

OM60 

2 

2 

NEP17 

1 

- 

OM182 

2 

- 

Alcanivorax 

2 

- 

Psuedomonadales 

2 

- 

Burkholderiales 

2 

1 

Nitrosomonabales 

1 

- 

Methylophilales 

1 

1 

Chromatiales 

2 

1 

DHB-2  unclass  Gamma 

2 

2 

Vibrionales 

- 

5 

Agg47 

1 

- 

K189A 

3 

- 

Rhodobacterales 

6 

18 

T31-112  unclass  Alpha 

1 

- 

Rhizobiales 

2 

2 

Shingomonadales 

1 

- 

SARll 

6 

2 

Rhodopirillales 

2 

- 

SAR116 

8 

- 

unclass.  Alphas 

3 

- 

OM75 

3 

- 

unclass.  Alphas 

4 

1 

SAR324  deltas 

17 

- 

OM27  deltas 

2 

- 

DeepAnt lF12  deltas 

3 

- 

Shingibacteriales 

5 

1 

Flavobacteriales 

2 

8 

Verrumicrobiales 

5 

- 

Planctomycetales 

3 

2 

SAR406 

4 

- 

EF100-108A04 

3 

- 

Gemma  tinomondales 

1 

- 

Acidobacterales 

4 

2 

Desulfobacterales 

3 

- 

SAR202 

3 

- 

Chloroflexi 

3 

- 

Prochlorafes 

1 

16 

Microthrix 

5 

- 

Lentisphaerales 

- 

1 

GII  Arch 

4 

- 

other  Euks 

2 

- 

GI  Arch 

3 

- 

no  16S: 

50 
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Figures  S1-S5.  Phylogenetic  trees  illustrating  the  relationship  of  SSU  rRNA  gene  sequences 
from  genomes  and  uncultivated  clones  represented  on  the  genome-proxy  microarray  (blue)  and 
their  close  relatives  (black)  as  “landmarks”.  Support  for  dendrogram  topologies  is  indicated  by 
bootstrap  values  at  nodes  determined  by  the  maximum  likelihood  method  (only  values  >50  are 
shown).  The  outgroups  used  were  Methanomethylovorans  victoriae  strain  TM  (AJ276437)  for 
the  bacterial  dendrograms,  and  Myxococcus  xanthus  strain  UCDaVI  (AY724797)  for  the 
archaeal  dendrogram  *The  publicly-available  SSU  rONA  sequence  for  the  Roseobacter-Wke 
alphaproteobacterial  clone  HTCC2255  (AATR0 1000062)  is  from  a  Gammaproteobacterium. 

S1.  Gamma-  and  Betaproteobacteria.  S2.  Alphaproteobacteria.  S3.  Deltaproteobacteria  and 
Spirochaetes.  S4.  Other  Bacteria.  S5.  Archaea. 
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Figure  S2.  Alphaproteobacterlal  array  targets  (blue)  and  their  close  “landmark” 
relatives  (black). 
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Figure  S3.  Deltaproteobacterial  and  Spirochaete  array  targets  (blue)  and  their  close 
"landmark”  relatives  (black). 


131 


Uncultured  Verrucomicrobiales  bacleriuTi  HF200  39  LOS  {EU36133S) 
UncLRjrsdVerrucomicrobiJles  bacterium  HF4000  13K17(EU361M2; 
l^wrucornkTobrum  spwojum  (X90515) 

Uncultured  Verrucomicrobiales  bacterium  ^R0  30102  (EU361331) 
fucop/ijius  /^Koida/»o7yt)Cus  (AB07397t} 

^  Uncultured  Verrucomicrobiales  bacterium  HF10  05E02  |EU3613i6| 

Uncultured  Verruccfflicrobiales  bacterium  HF770  11D24  (EU36"340) 

L•n7(ap^a«ra  snimsi  sp.  HTCC21SS  (AY390428) 

Rhodop/nSuli  ba/Jici  SH  1  (8X294149} 

Uncultured  Plre^/uta  done  5N14  (AF029078] 

Uncultured  planctomycete  5H12  (EF391BS4) 


Blistopirellula  marina  DSM  3645  (NZ  AANZ01000001| 

Uncultured  Planclorrycetales  bacterium  HF500  40D21  (EU361151) 
Uncultured  marine  bacterium  159  EBAC750- 03802  |AY458631) 

Uncultured  SAR406ciimer  bacterium  HFtO  18013  |EU361 308) 
Uncultured  lubac^um  SAR406  (U34M3) 

Uncultured  SAR406  baclerium  HF500  01L02  (EU351270} 

Uncultured  SAR406  bacterium  HF4(IOO  22815  (EU361303| 

■  Uncultured  bacterium  HF130  06E03  (EU3S1125I 
Uncultured  bacterium  EF10IM06A04  (AY527378) 

■  Uncultured  bacterium  HF500  1  6016  (EU36'  1201 
Uncultured  marine  bacterium  Ant4CS  (0029 S242) 

UncuRund  Gemm^onadales  bacterium  HF1 30  030C3iEU35113S) 

Genvnetimonas  aurarttiaca  (A607273S} 

Uncultured  Acidobacteriales  bacterium  HF200  23L0S IEU360982) 

Uncultured  Acidobacteriaies  baderium  HF4000  26D02 IEU360996) 

~  Acii(A)(MCfenumcapsu<irum(026171) 


JSSj- 


UncuRured  Acidobacteriales  bacterium  HF7?0,27F21  (EU360993) 

1001 - -  Pobriiwcfer  rroens/i  (14610021 

' —  Polaribaclerdoifdonensls  MED^  52  (00481463) 

-  Uncultured  BictwoWetis  bacterium  SBI2-1 5  P41  A3  (DG272742) 

—  Uncultured  Flavobactenaceae  ckme  bibfl  10.d03(AJ937771| 
Marine  CFB-flroup  bacterium  88FL7  (AF366005I 


Lrfc 


Leeu^enrioeA^fa  blandensrs  MED217  /AANC01 00001 1 1 
‘  DoAdom'a  dongriaensis  ME  0134  (HZ  AAUZ01 000003) 

-  Flavobacleriales  bacteriurr  HtCC2170  'AAOC0100(M)041 

-  Pob<5«MfaiM  btformali  HTCC2501  (AAOlO  1000007) 

-  Crocwbacfer  allant/cus  HTCC2559  (AAVP01000C01) 
Uncultured  marint  bacterium  Ant39E11  (0029  5241) 


-  EB080,LC6E11(NOTPUBl.iSH:D| 

Uncultured  Spbingobactenales  bacterium  |iF10_19H17 1103613141 
UncuRured  S^mMbacterialis  bacterium  HF130  338'9  (EU361316) 


Pedobacfersp.  BAL3d  \HZ  A8CM0 10000341 
Mr  Prodi/OfDCOCCUS  marinus  str.  AS960  ■  (CPOOOSal) 

P-  Procritorwoccus  merinos  str.  MIT  9312 IAF0533981 
W  Prochhmzoccui  merfnus  sir  MIT  95'  5  (CP000552) 

P  Procli/orococcus  marinus  pasioris  str.  CCMP1986  (BX572090) 
ProcAtorococcus  mirinus  marinw  str  CCMP1375  (AEfl17126) 
ProcWofococcus  marinus  sir.  NATl2A  (CP000C95) 

PnocAIOfOCPcais  marinos  str.  MIT  9313  (AF053399; 


•a  Symecriococcus  sp,  CC9902 /CP00009T) 
ttf  Symchocwxus  sp.  BL107  (AATZOOOOOQOI 
9^  SynecAococci/s  sp.  CC9605  (AY  172802) 

J-  Syoecriococcus  sp.  Vilt  8102  (NC  005070) 
n-  Synechococcus  sp.  WH  7805  (AF001478} 

P  SyrrecAococcos  sp.  WH  7803  (AP3t'2911 
*-  Syn«c/»ococcuS5p.CC9311(AY172801i 

—  iynechococcos  sp.  RCC307  (NC  009482) 

-  Cyanortecesp.CCY0110(AAXW1)1000000) 

Uncultured  Chloronexi  bacterium  HF200  06115  (EU361 029) 
Uncultured  Chloroflexi  bacterium  1^4000  2^02(EU381084’i 
Unidentitiid  MibactMtum  SAR202  (11207971 
—  Uncultured  Cbloroflexi  bacterium  SAR292  (AY534092) 
Uncultured  Chloroflexi  bacterium  HfSOO  03M05  (EU361063) 
Uncultured  Chloroltexi  bacterium  HF200  09109 IEU361033) 

-  Uncultured  bacterium  HF130  D6  PI  (80300613) 

~  Uncultured  Chloroflexi  HF770  09Efl3  (EU361081  i 

Uncuhured  actinobicterium  HF200  20K23lEU'361003) 
Uncultured  marine  bacterium  Ant4E12  (DQ295238I 
Uncultured  actinobacterium  HFiOOO  04C13(Eu361018| 
Uncultured  actinobacterium  HF7Q  7F14  <EU361000) 
Candidatus  Aferotfirix  pirvIctSa  sinifl  8)617*2  (DQ147279) 
janibacfersp.  KTCC2S49(AAMNQ  1000002) 

Marine  actinobacterium  PHSC20C1  (AAOB01 000009) 


0.10 


Verrucomicrobiales 

ILentisphaerales 

Planctomycetales 

SAR406  Cluster 

EF100-108A04  Cluster 
Gemmatimonadales 

Acidobacteriales 


Flavobacteriales 

(Bacteroidetes) 


Sphingobacteriales 

(Bacteroidetes) 


Prochlorales 

(Cyanobacteria) 


Chroococcales 

(Cyanobacteria) 

SAR202  Cluster 
(Chloroflexi) 

Other  Unclassified 
Chloroflexi 

Candidatus  Microthrix 
(Actinobacteiia) 

Actinomycetales 

(Actinobacteiia) 


Figure  S4.  Other  bacterial  array  targets  (blue)  and  their  close  “landmark”  relatives 
(black). 
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Figure  S5.  Archaeal  array  targets  (blue)  and  their  close  “landmark”  relatives  (black). 
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Figure  S6.  Origin  of  array  targets  and  their  relative  array-based  occurrences  in 
Monterey  Bay  and  Hawaii  samples,  (a)  Derivation  of  array  targets,  either  as 
environmental  genome  fragments  from  Hawaii  (blue),  Monterey  (green),  other  marine 
sites  (beige),  or  from  marine  microbial  genomes  (black).  The  number  of  targets  in 
each  category  is  indicated,  (b)  The  proportional  abundance  of  each  target  type  in  57 
Monterey  Bay  samples,  (c)  The  proportional  abundance  of  each  target  type  in  4 
Hawaii  samples,  (b)  and  (c)  are  measured  as  the  relative  proportion  of  total  array 
signal  across  all  samples  hybridized.  Targets  derived  from  a  particular  environment 
are  proportionally  more  abundant  in  that  environment. 


134 


Figure  S7.  Array  profiles  for  specific  phylogenetic  groups  of  targets,  (a)  Roseobacter 
(b)  SAR86  (c)  SAR11. 
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Figure  S8.  Mixed  layer  depth  (MLD)  over  the  sampling  period,  with  hybridized 
samples  indicated.  MLD  was  calculated  as  the  first  depth  (^10m)  with  >0.1  deg  C 
difference  from  the  previous  meter  (per  MBARI  BOG  group,  Reiko  Michisaki,  pers. 
comm.)  X-axis  indicates  sampling  date  in  continuous  numbered  days  since  Jan.  01, 
2000,  and  y-axis  indicates  depth.  Dashed  red  line  highlights  30m  depth.  Trendline 
shows  moving  averageof  MLD  with  period  of  2.  The  MLD  at  this  location  is  typically 
deepest  in  the  winters  and  shallowest  toward  the  end  of  the  spring/summer  upwelling 
season.  30m  samples  were  both  within  and  below  the  ML,  and  the  site  shows  high 
MLD  variability 
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Figure  S9.  Population  heterogeneity  in  all  targets  occurring  in  all  three 
pyroseqeunced  samples.  Sample  colors  indicate  depth  and  oceanographic  season, 
as  in  color  sample  legend  in  (a).  Red  stars  next  to  samples  across  the  top  denote  the 
three  pyrosequenced  samples,  (a)  EB000_31A08.  (b)  EB000_39F01.  (c) 
EB000_41B09.  (d)  EB000_55B11.  (e)  EB080_02D08.  (f)  EB080_L08E1 1.  (g) 
EB080_L11F12.  (h).  EB080_L27A02.  (I).  EB080_L43F08. 
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Chapter  4 


Conclusions  and  Future  Directions 
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In  this  thesis,  a  new  tool  for  profiling  marine  microbial  communities  was 
developed  and  validated.  It  was  then  applied  to  study  time-series  ecology  of 
marine  microbial  communities  in  Monterey  Bay,  CA. 

The  prototype  genome  proxy  array  (Chapter  2)  targeted  thirteen 
environmental  genome  fragments  derived  from  Monterey  Bay,  the  Hawaii  Ocean 
Time-series  station  ALOHA,  and  Antarctic  coastal  waters,  as  well  as  three  8-kb 
regions  of  the  Prochlorococcus  MED4  genome.  Multiple  (n=20-60)  70-mer 
oligonucleotide  probes  targeted  each  genome  fragment  or  genome,  and  were 
distributed  along  each  ~40-160kbp  contiguous  genomic  region.  When  hybridized 
to  targets  or  target  relatives,  the  array  correctly  identified  the  presence  or 
absence  of  each,  and  could  discern  related  non-target  genotypes.  It  showed 
minimal  cross-hybridization  to  organisms  with  <~75%  ANI  to  the  targets.  When 
target  cells  and  their  relatives  were  spiked  into  a  background  of  seawater,  the 
array’s  discriminatory  ability  did  not  diminish,  and  the  array-based  organism 
intensity  correlated  linearly  to  cell  numbers  aaoss  six  orders  of  magnitude  (R^  of 
1 .0).  A  related  strain  (86%  ANI  to  target)  also  showed  a  linear  correlation  of 
signal  to  concentration  (R^  of  0.9999).  The  limit  of  detection  in  a  natural 
background  was  0.1%  of  the  community  for  targeted  genotypes,  and  1  %  of  the 
community  for  their  cross-hybridizing  relatives.  Cross-hybridizing  genotypes 
produced  distinct  hybridization  patterns  across  each  target  probe  set,  allowing 
related  strains  to  be  distinguished. 

Having  developed  and  validated  the  prototype,  an  expanded  version  was 
developed  and  constructed  targeting  268  genotypes,  representing  all  major 
marine  microbial  clades,  and  spanning  the  relevant  intra-clade  variability  among 
abundant  groups  where  possible  (Chapter  3).  This  array  was  used  to  profile  57 
samples  collected  over  four  years  in  Monterey  Bay  (MB),  at  three 
oceanographically  distinct  depths  (the  photic  zone,  just  below  the  mixed  layer, 
and  the  subphotic  zone,  Om,  30m  and  200m  respectively),  as  well  as  a  single 
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depth  profile  (25m,  75m,  125m  and  500m)  from  Hawaii.  Three  MB  Om  samples 
were  additionally  pyrosequenced  to  allow  for  cross-comparison  with  array  data. 
This  cross-comparison  showed  a  strong  linear  correlation  between  a  targeted 
genotype’s  array  signal  and  its  metagenomic  abundance  (R^=0. 85-0.91  for  the 
three  samples).  In  addition,  the  Monterey  Bay  targets  produced  in  silico 
hybridization  to  1 .9%  -  2.5%  of  the  pyrosequence  reads  in  the  three  samples. 

Based  on  similarity  of  array  profiles,  MB  samples  clustered  into  shallow  (Om 
and  30m)  and  deep  (200m)  samples.  -35%  of  targeted  genotypes  were  present 
in  one  or  more  MB  sample,  and  the  majority  showed  depth-specific  distributions. 
Targeted  clades  expected  to  be  present  in  this  environment  (e.g.  SAR1 1 , 

SAR86,  etc)  showed  signal  on  the  array,  with  depth  distributions  generally 
consistent  with  previous  observations  of  each  clade  at  this  site  and  elsewhere,  A 
relatively  small  number  of  taxa  accounted  for  much  of  the  variability  between 
profiles  from  different  depths.  Reflecting  the  highly  variable  and  dynamic  mixed 
layer  depth  at  this  site,  30m  samples  did  not  cluster  separately  from  Om  samples. 
In  addition,  no  correlation  to  oceanographic  season  was  observed  among 
profiles,  although  bloom  and  post-bloom  signatures  were  evident  as  increased 
profile  intensities  in  Om  samples  and  in  decoupled  nutrient  correlations  to  sample 
variability  in  200m  samples.  Together  with  oceanographic  data,  array-based 
population  insights  are  allowing  us  to  work  towards  identifying  ecotypes  in  poorly- 
understood  marine  clades. 

The  expanded  genome  proxy  array  has  begun  to  realize  its  potential  for 
high-throughput  profiling  of  marine  samples,  tracking  a  larger  number  of  targets 
simultaneously  at  higher  phylogenetic  resolution  than  has  been  previously 
possible  with  other  methods. 

In  addition  to  the  ongoing  analyses  of  Chapter  3  data  mentioned  in  its 
concluding  paragraphs,  there  are  several  other  current  or  future  projects 
involving  the  genome  proxy  array  that  may  be  worth  pursuing.  These  fall  into  two 
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categories;  protocol  improvements  and  application  to  particular  research 
questions  in  marine  microbial  community  ecology. 

There  are  a  number  of  protocol  improvements  that  might  ideally  be  made 
with  this  array  platform.  In  particular,  the  random-primed  A/B/C  PCR-based 
amplification  and  labeling  method  is  laborious,  requiring  several  very  long  days 
for  each  set  of  reactions  and  several  physically-challenging  steps.  We  have 
discussed  testing  MDA  versus  A/B/C  PCR,  and  the  use  of  MDA  with  microarrays 
has  been  described  in  the  literature  (e.g.  Wu  et  al  2006  for  MDA).  The  relative 
biases  and  skewing  of  MDA  remain  unclear  and  several  studies  indicate  that  they 
are  not  insubstantial.  Therefore,  before  switching  array  protocols  to  MDA-based 
amplification  and  iabeiing,  a  comprehensive  set  of  comparisons  would  be 
required.  Specifically,  a  single  sample  would  be  treated  in  three  different  ways.  1 . 
Direct  pyrosequencing.  2.  Optimized  A/B/C  amplification  followed  by 
pyrosequencing.  3.  Optimized  MDA  amplification  followed  by  pyrosequencing.  If 
it  were  important  to  save  financial  resources,  a  single  454  run  could  be  used  to 
address  this.  Although  the  depth  of  coverage  in  these  diverse  communities  is 
often  not  high  even  for  dominant  taxa,  use  of  a  bloom  sample  might  allow  a 
single  454  run  to  be  split  and  used  for  multiple  samples. 

Another  protocol  improvement  would  focus  on  better  consistency  of  slide 
PLL  coating.  Despite  improvements,  some  PLL-coated  slide  batches  exhibit 
considerable  surface  irregularities  and  peeling,  which  affect  the  quality  and  yield 
of  data.  Manual  removal  of  poor-quality  spots  is  time-consuming,  and  for 
particularly  bad  arrays  additional  hybridizations  must  be  performed.  A  number  of 
coated  slides  for  array  printing  are  commercially  available  and  I  tried  several 
earlier  in  my  thesis,  but  did  not  have  good  results.  Ideally,  the  homemade 
method  can  be  improved,  because  it  is  not  particularly  time-consuming  or  difficult 
and  it  is  quite  inexpensive  (see  Appendix  1 ,  Reagent  Worksheet,  for  details).  We 
have  addressed  issues  of  water  and  reagent  quality,  washing  configuration, 
ambient  dust  amelioration,  drying  spin  force  and  duration,  and  storage.  However, 
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surface  performance  remains  stubbornly  variable  among  batches,  in  a  difficult-to- 
predict  way  prior  to  array  use. 

Lastly,  a  major  area  for  continued  improvement  is  in  the  data  processing 
pipeline.  For  the  prototype  array  paper,  I  worked  with  Kostas  Konstantinidis,  then 
a  postdoc  in  the  lab,  to  develop  a  series  of  scripts  for  pre-processing  the  array 
data,  which  is  in  the  form  of  gpr  files  from  the  array  scanner.  For  the  Monterey 
Bay  time  series  paper,  I  worked  with  John  Eppley,  currently  part  of  the  lab  as  a 
computational  expert,  to  build  a  next  generation  script  that  consolidated  tasks 
into  a  single  workflow.  During  the  switch  from  prototype  to  expanded  array,  Dr. 
Eppley  and  I  refined  several  of  the  preprocessing  steps,  as  described  in  the 
methods  of  Chapter  3.  However,  there  remains  room  for  further  optimization. 
Issues  worth  particular  continued  consideration  include:  i)  background 
subtraction,  ii)  array-to-array  normalization,  iii)  genome-proxy  array-specific 
filtering  thresholds  to  remove  spurious  cross-hybridization. 

For  the  Monterey  Bay  data  analysis,  I  began  using  an  open-source  software 
package  initially  designed  by  the  Broad  Institute  for  array-based  gene  expression 
analysis.  This  software,  GenePattern  (mentioned  in  the  methods  of  Ch.  3),  is 
module-based  and  designed  for  tailoring  to  novel  needs.  Several  of  its  existing 
modules  are  useful  for  examining  pre-processed  data  from  the  genome  proxy 
array,  and  modules  can  be  pipelined  together,  with  an  archiving  option  for 
documenting  and  preserving  specific  combinations  of  analyses  and  parameters. 
Dr.  Eppley  has  written  the  data  preprocessing  script  to  be  compatible  with  the 
GenePattern  architecture,  to  allow  it  to  easily  be  added  to  GenePattern  as  a 
module.  This  will  facilitate  non-expert  use,  rapid  testing  of  a  range  pre-processing 
parameters,  and  direct  porting  of  output  data  into  other  analysis  modules.  This 
will  greatly  help  further  optimizing  the  data  processing  steps  mentioned  above.  In 
addition,  the  two  R-based  ecological  analysis  tools  used  for  exploring 
correlations  between  array  data  and  environmental  parameters  will  be  relatively 
straightforward  to  convert  into  GenePattern  modules,  since  GenePattern  is  fully 
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compatible  with  the  R  language.  These  changes  will  continue  to  streamline  the 
analysis  pipeline  for  array  data. 

Among  ongoing  and  future  applications  of  the  array,  a  deeper  examination 
of  Hawaiian  samples,  for  comparison  to  MB  samples  and  to  investigate  time 
series  dynamics  at  Station  ALOHA,  is  being  performed  by  a  graduate  student  in 
the  lab,  Laure-Anne  Ventouras.  My  Hawaii  hybridizations  from  not  only  HOT179 
but  also  several  other  Hawaii  profiles,  using  both  the  prototype  and  the  expanded 
array,  produced  generally  good  results  and  the  presence  of  expected  taxa. 

These  open  ocean  samples  produced  markedly  different  profiles  than  the  coastal 
Monterey  Bay  samples,  as  expected.  As  mentioned  in  Chapter  3,  however,  some 
of  the  hybridizations  produced  weak  signal  which,  when  using  the  same  data 
processing  parameters  optimized  for  MB  samples,  resulted  in  very  few  targets 
being  called  “present”.  Whether  this  is  due  to  inaccurate  DMA  quantification,  the 
presence  of  contaminants,  hybridization  irregularities,  or  the  need  to  re-tailor 
processing  parameters  for  the  new  site,  remains  unclear,  and  needs  to  be 
answered  to  maximize  the  utility  of  the  array  for  cross-habitat  comparative 
studies. 

In  addition  to  the  purely  array-based  profiling  of  HOT  samples, 
pyrosequence  data  are  available  for  several  of  these  samples  and  a  comparison 
of  array  and  pyrosequence  data  is  ongoing.  Working  with  another  graduate 
student  in  the  lab,  Yanmei  Shi  (a  co-author  on  the  Ch.  3  manuscript),  we  cross- 
compared  pyrosequence  and  array  data  for  HOT179  75m.  The  correlations 
between  target  abundance  by  pyrosequence  and  by  array  intensity  are  in  the 
same  range  as  seen  for  Monterey  Bay  pyrosequence  cross-validations.  Using 
the  improved  and  rapid  pyrosequence-to-array  pipeline  developed  for  the  MB 
datasets,  this  75m  sample  will  be  re-examined,  together  with  the  25m,  125m.  and 
500m  HOT179  datasets.  In  addition,  it  is  likely  that  deeper  investigations  of  the 
signals  present  in  the  array  profiles  versus  the  sequence  datasets  will  be 
appropriate  for  these  samples,  to  examine  and  validate  e.g.  population  variability, 
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as  is  underway  for  the  MB  datasets.  The  first  stage  of  this  cross-comparison,  the 
correlation  of  target  abundances  produced  by  the  two  methods,  will  likely  be 
included  in  the  manuscript  of  Chapter  3,  since  the  array  profiles  of  these  samples 
are  compared  therein  to  the  MB  samples.  The  second,  more  specific  cross¬ 
comparison  of  populations  may  be  most  usefully  performed  as  part  of  the  Hawaii 
time-series  profiling  paper,  the  work  currently  being  undertaken  by  Ms. 

Ventouras. 

Comparing  the  open-ocean  Hawaii  array  profiles  with  the  coastal  Monterey 
Bay  profiles  poses  the  question  of  the  habitat  specificity  of  the  genome  proxy 
array.  Can  it  only  be  reliably  used  at  locations  from  which  its  targets  are  derived? 
Based  on  existing  data,  the  answer  is:  No,  the  array  can  be  an  effective  cross¬ 
location  platform.  Although  the  majority  of  signal  at  each  location  came  from 
clones  derived  from  that  location,  this  could  be  interpreted  in  two  ways.  Hawaii 
versus  Monterey  samples  may  represent  generically  distinct  open  ocean  versus 
coastal  communities,  and  the  clone-derivation-to-signal  pattern  may  simply 
represent  this  overall  habitat  difference.  Alternatively,  this  pattern  could  be 
indicative  of  site  rather  than  habitat  specificity;  that  is,  another  set  of  coastal 
versus  open  ocean  samples  would  not  show  appreciable  signal,  and/or  would  not 
show  the  same  inversion  of  clone-derivation-to-signal  in  the  two  habitats.  The 
former  explanation  seems  most  likely,  for  several  reasons.  First,  the  prototype 
array  was  hybridized  to  coastal  waters  from  the  East  coast,  a  roughly  similar 
habitat  but  very  distant  location  to  the  origin  of  most  of  the  targeted  clones. 
However,  a  number  of  these  Monterey  Bay-derived  clones  produced  strong 
signal  from  Woods  Hole  communities  (see  e.g.  Figure  4a,  Rich,  Konstantinidis 
and  Delong,  2008,  Chapter  2).  Second,  a  number  of  clones  produce  significant 
array  signal  in  samples  from  locations,  and  even  habitats,  very  different  to  those 
of  their  origin.  For  example,  the  deep-consistent  clade  described  in  Chapter  3, 
whose  taxa  were  consistently  present  and  abundant  in  MB  200m  samples, 
included  two  (of  10)  targets  derived  from  Hawaii,  and  one  from  500m  in  the 
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Antarctic  Polar  Front.  The  same  is  not  true  for  the  shallow-consistent  cluster, 
which  re-emphasizes  a  point  made  in  the  combined  clustering  of  Hawaii  and 
Monterey  Bay  samples  in  Chapter  3,  Figure  4c,  and  observed  in  other  marine 
microbial  studies;  deep  communities  from  very  different  environments  can  be 
significantly  more  similar  than  their  surface  counterparts.  Thus,  overall,  there  is 
inherently  greater  habitat  specificity  involved  in  probing  shallow  versus  deep 
communities.  However,  evidence  suggests  this  is  indeed  habitat  rather  than  site 
specificity,  which  would  enable  the  array  to  be  effectively  applied  to  other  coastal 
and  open  ocean  samples.  Further  samples  from  several  diverse  locations  should 
be  hybridized  to  the  array  to  confirm  this  hypothesis.  Extracted  DMAs  from 
Woods  Hole,  coastal  Chile,  coastal  Oregon,  Bermuda,  and  Antarctica  are  all 
available  in  the  lab  and  so  performing  a  suite  of  hybridizations  would  not  be  a 
major  undertaking. 

In  addition  to  examining  the  genome-proxy  array  data  as  overall  organism 
signals  and  hybridization  patterns  across  probesets,  it  may  also  be  worthwhile  to 
investigate  signals  from  single  probes,  representing  single  genes.  For  example, 
cross-hybridization  to  just  one  or  a  few  probes  in  a  probeset  is  considered 
spurious  signal  and  is  not  further  examined.  However,  strong  signal  to  a 
particular  gene  that  is  not  reflected  in  the  rest  of  the  target  organism  probes 
could  reveal  the  importance  of  particular  genes  in  some  samples,  with  their  high 
conservation  and  presence  in  other  organisms  causing  their  cross-hybridization. 
Thus,  significant  decoupling  of  the  gene  and  organism  signals  could  be  a  useful 
flag  for  processes  of  importance.  In  addition,  because  probes  were  not  chosen  to 
particular  genes  but  rather  selected  based  on  predicted  hybridization  kinetics,  the 
probes  are  not  limited  to  genes  with  already-recognized  ecological  importance, 
and  include  many  hypotheticals.  Although  the  sequence  divergence  of  the 
majority  of  genes  makes  strong  cross-hybridization  unlikely,  among  the  ~5360 
genes  targeted  by  the  array  there  are  likely  to  be  several  interesting  stories  in  the 
data  already  obtained  from  the  Monterey  Bay  samples. 
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A  different  and  interesting  possible  application  of  the  genome  proxy  array  is 
for  hybridization  to  amplified  and  labeled  community  cDNA.  Frias-Lopez  and  Shi 
et  al.  (2008)  have  optimized  a  protocol  for  preferentially  amplifying  community 
mRNA  from  marine  samples.  Working  with  Yanmei  Shi  and  Laure-Anne 
Ventouras,  we  amplified  and  reverse  transcribed  community  RNA  from  a  Hawaii 
sample,  HOT179_75m,  and  Ms.  Ventouras  and  I  performed  side-by-side 
replicated  DMA  and  cDNA  array  hybridizations.  Following  the  logic  of  community 
metatranscriptomics  in  e.g.  Frias-Lopez  and  Shi  et  al.  (2008),  in  order  to  interpret 
transcript  abundance  it  is  necessary  to  also  measure  gene  abundance,  in  order 
to  normalize  expression  to  gene  copy  number.  A  first-pass  analysis  of  the  data 
using  the  existing  array  pipeline  showed,  unsurprisingly,  that  overall  organism- 
based  expression  levels  were  low,  when  RNA  hybridization  data  were  treated 
identically  by  the  script  and  averaged  across  all  probes  for  an  organism. 

However,  high  expression  across  all  or  most  of  a  genome  fragment’s  randomly- 
targeted  genes  would  not  be  expected.  When  genes  were  examined  on  an 
individual  basis,  a  small  number  of  genes,  primarily  involved  in  housekeeping 
and  core  metabolic  functions,  had  high  array-based  expression  signal.  When 
cross-compared  to  the  overall  pyrosequencing-based  metatransaiptomics 
analysis,  these  genes  were  not  among  those  with  the  highest  expression  levels. 
However,  since  the  metatranscriptomics  analysis  was  global  in  scope  and  the 
array  only  targets  a  small  fraction  of  all  community  genes,  an  additional  step  is 
required  of  using  just  the  array  targets  to  recruit  metatranscriptomic  reads,  as 
was  done  for  the  DNA-based  array-versus-pyrosequencing  comparison 
described  in  Chapter  3.  That  pipeline  was  not  available  at  the  time  of  the  RNA 
hybridization  and  the  matter  has  not  been  revisited,  but  is  worth  doing.  Due  to  the 
small  representation  of  total  gene  space  on  the  array,  this  particular  array 
platform  would  not  be  a  good  primary  tool  for  profiling  community  expression,  but 
might  provide  useful  secondary  expression  information. 

An  alternative  form  of  microarray,  related  to  the  genome  proxy  array,  is  an 
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environmental  clone  library  array.  This  library  array  would  be  akin  to  the 
community  genome  arrays  described  in  chapters  1  and  2,  but  instead  of  each 
probe  spot  being  a  whole  genome  of  a  cultivated  or  isolated  microbe,  it  would  be 
a  cloned  genome  fragment  from  the  environment.  This  bypasses  the  need  for 
sequencing  clones  before  they  are  targeted.  Indeed,  this  method  has  recently 
been  used  as  a  library-screening  tool  (e.g.  Soule  et  al.,  2006),  with  thousands  of 
clone  library  members  arrayed  and  then  queried  with  e.g.  labeled  PCR  amplicons 
of  genes  of  interest.  Rather  than  using  such  arrays  primarily  as  a  library 
exploration  tool,  when  the  Delong  Lab  already  has  a  well-refined  library 
macroarray  production  and  screening  pipeline  worked  out,  these  arrays  would 
instead  be  used  to  query  amplified  and  labeled  environmental  DNA,  just  as  the 
genome  proxy  array  is. 

This  approach  has  great  appeal  because  it  removes  a  portion  of  the 
deterministic  nature  of  array  design.  For  example,  in  order  to  be  represented  on 
the  genome  proxy  array,  all  targeted  genotypes  first  had  to  be  sequenced,  and 
both  genomes  and  genome  fragments  were  generally  chosen  for  sequencing 
because  they  were  already  known  or  suspected  to  be  involved  in  a  process  of 
interest.  Instead,  if  large  numbers  of  clones  from  genomic  libraries  were  arrayed 
in  a  random  fashion,  then  hybridization  results  would  reveal  which  hitherto 
unrecognized  clones  might  be  highly  abundant,  or  vary  along  ecological 
gradients.  Those  clones  would  then  be  targeted  for  sequencing  and  further 
characterization.  It  is  even  possible  that  for  particularly  abundant  taxa,  or  highly 
uneven  samples,  multiple  clones  deriving  from  the  same  taxa  or  clade  could  be 
binned  based  on  similar  hybridization  intensities,  akin  to  the  coverage-  and  GC- 
content-  based  binning  of  metagenomic  data  in  Tyson  etai,  2004. 

Lastly,  further  Monterey  Bay  investigations  should  be  pursued  using  the 
genome  proxy  array.  There  is  a  wealth  of  samples  available  that  were  not 
included  in  this  first  study,  for  a  variety  of  reasons  (e.g.  they  were  collected  using 
different  filtration  methods,  or  from  different  volumes  of  seawater,  etc.).  These 
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samples  span  several  more  years  and  cover  two  additional  stations,  in  the  inner 
and  outer  Bay,  respectively,  and  were  sequenced  with  the  same  frequency  as 
Station  M1 .  In  addition,  there  are  several  sets  of  samples  from  the  Monterey  Bay 
Aquarium  Research  Institute  (MBARI)  Biological  Oceanography  Group  (BOG) 
cruises  along  the  old  CalCOFI  (California  Cooperative  Oceanic  Fisheries 
Investigations)  Line  67  transect  which  runs  from  Monterey  Bay  to  300km 
offshore.  This  transect  crosses  four  distinct  oceanographic  zones,  from  the 
seasonal  upwelling  band  along  the  California  coast  (up  to~20-50km  offshore)  all 
the  way  out  to  the  California  Current  (170-300km  offshore)  (Pennington  et  al. 
2007).  The  great  majority  of  all  these  MB  samples  have  not  been  extracted. 

In  the  same  vein,  a  fascinating  question  and  suggestion  by  my  co-advisor 
involved  the  possibility  of  hybridizing  much  older  archived  DMAs,  from  samples 
stored  in  ethanol  or  formaldehyde,  from  the  last  century.  Again,  Monterey  Bay 
sits  at  a  nexus  of  oceanographic  and  fisheries  exploration,  with  a  rich  history  of 
sample  collection  and  observation.  It  exhibits  a  strong  El  Nino  signature,  and  has 
shown  warming  over  the  last  hundred  years.  Since  the  array  hybridizes  to 
fragmented  DNA,  which  is  amplified  randomly  prior  to  and  during  the  labeling 
process,  it  is  reasonable  to  think  that  fairly  degraded  DNA  (e.g.  <5000bp,  ideally 
>500bp)  could  still  hybridize  reliably  and  meaningfully  to  the  array.  This  would 
have  to  be  tested  with  sheared  DNA.  In  addition,  it  would  be  important  to  identify 
preservation  biases  in  the  integrity  of  DNA  from  different  clades;  e.g.,  one  can 
imagine  a  GC-rich  clade  perhaps  faring  better  over  time.  Finally  and  most 
importantly,  this  assumes  the  existence  of  such  preserved  samples,  in  enough 
numbers  and  with  sufficient  associated  metadata  to  be  useful  for  mapping 
microbial  community  change  over  the  time  scales  involved.  Further,  it  assumes 
that  the  guardians  of  such  samples  would  be  willing  to  give  some  portion  to  this 
endeavor. 

In  conclusion,  there  are  a  number  of  next-step  experiments  that  could  be 
undertaken  to  improve  the  existing  protocols  and  to  further  define  the  working 
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scope  of  the  array.  There  are  also  a  number  of  research  questions  that  the  array 
could  be  a  useful  tool  for  answering,  given  its  low  cost  and  relatively  high 
throughput,  with  good  information  yield  per  units  of  time,  money,  and  DNA,  and 
its  unique  attributes.  It  can  contribute  to  our  exploration  of  marine  microbial 
communities  at  finer  scales  of  both  sampling  and  phylogenetic  resolution  than 
are  practical  using  other  methods.  In  addition,  the  array  is  uniquely  useful  for 
identifying  populations  of  related  genotypes,  which  is  currently  difficult  to  do 
using  other  methods,  even  metagenomics,  without  a  pr/or/ expectations  guiding 
analyses.  Therefore,  even  as  the  cost  of  sequencing  decreases,  the  array  can  be 
a  highly  complementary  tool  in  a  microbial  ecologist’s  repertoire,  by  revealing 
features  of  the  microbial  community  not  readily  observed  with  other  tools. 
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Appendix  1 


Protocols  &  Source  Sheets  Developed  during  this  thesis  for  the  Genome  Proxy 
Array. 
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Preparing  poly-lysine  slides  for  printing  microarrays: 

Coal:  To  coat  glass  microscope  slides  with  poly-lysine  (“PLL”),  so  they  are  “sticky”.  Then 
when  the  arraying  robot  prints  DNA  spots  onto  them  (making  a  microarray)  the  DNA 
sticks. 


Materials: 

Cold  Seal  Micro  Slides,  Cat.  No.  301  0,  3" 
X  1",  1mm  thick,  Fisher  #  12-518-1  OOA 
NaOH  pellets,  e.g.  Sigma  #  S8045 
95%  EtOH 
lots  of  Milli-Q  H20 
IX  or  1  OX  PBS 

Poly-L-lysine  solution,  0.1%  (w/v).  Sigma# 
P8920 

slide  boxes,  Fisherbrand,  foam-lined  not  cork- 
lined,  Fisher  #03-448-4 


1  secondary  containment  tray 
3L  glass  beakers  (2-i-) 

1  1 L  plastic  beaker 

plastic  boxes  for  PLL-coating  of  slides 
(pipette-tip  boxes  work  fine) 
metal  slide  racks 
plastic  washing  container 
plastic  wrap  or  tinfoil 
rubber  band 


Notes: 

*rinse  all  containers  with  RO  before  using,  to  remove  dust,  etc..  Keep  these  containers 
separate  (at  Cin’s  bench). 

*  throughout  this  process,  make  sure  the  slides  in  the  racks  stay  well  separated,  by 
running  a  gloved  finger  across  the  top  edge  of  the  slides,  etc  -  during  rinsing,  etc.  -  at 
all  steps 

*  to  keep  dust  from  sticking  to  the  slides,  once  the  protocol  is  begun  try  to  keep  the 
slides  submerged  in  solution  at  all  times,  &/or  covered. 

*  do  NOT  use  powdered  gloves  during  this  protocol. 

*  slides  MUST  be  stored  at  least  two  weeks  before  spotting  DNA  (DeRisi  lab  says  2 
weeks,  Schoolnik  lab  says  3  weeks,  Somero  lab  says  1  week).  Don’t  use  slides  that  are  >  4  mos. 
old  for  printing,  sometimes  the  poly-lysine  degrades  (DeRisi  says  4  mos.,  Schoolnik  labs  says 
3  mos.). 


1.  Wash  slides:  Although  the  slides  come  “clean”  in  a  box,  there  is  still  a  fair  bit  of  dust 
and  dirt  on  them  and  they  need  to  be  extremely  clean  before  they  are  coated,  because 
any  dirt  will  cause  irregularities  in  the  coating  which  will  be  weak  spots,  and  may  peel 
off. 


make  up  wash  solution: 

need  about  600mls  of  solution  to  cover  a  metal  slide  rack  in  a  3L  beaker 
in  a  glass  beaker  (2L  or  bigger),  mix: 


For  60  slides:  1  2Q0mls 

1  20  g  NaOH  pellets 
720  mis  95%  EtOH 
480  mis  Milli-Q  H20 


For  90  slides:  1  SOOmIs 

1  80  g  NaOH  pellets 
1080  mis  95%  EtOH 
720  mis  Milli-Q  H20 


For  1  20  slides:  2400mls 

240  g  NaOH  pellets 
1440  mis  95%  EtOh 
960  mis  Milli-Q  H20 


(final  cone,  for  this  solution  is  5796  EtOH,  1  096  w/v  NaOH) 
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mix  with  stir-bar  on  stir-plate  (no  heat)  until  fully  dissolved  (takes  ~1  5min) 

Thoroughly  rinse  1  glass  3L  beaker  for  each  30  slides. 

Place  slides  into  metal  racks  (our  current  racks  take  30  slides  each),  use  air 
canister  to  spray  off  slides. 

Put  racks  into  beakers,  and  beakers  into  one  large  plastic  tub  (for  secondary 
containment  on  shaking  table  -  this  solution  is  highly  basic  and  we  don’t  want  it  to 
spill).  Put  secondary  containment  tray  on  shaker,  into  hood. 

pour  600mls  wash  solution  into  each  3L  beaker.  Cover  the  beakers,  with  plastic  wrap 
or  foil,  to  prevent  dust  from  getting  in. 

shake  aentiv  on  table  to  wash,  about  2hrs 

2.  Rinse  slides:  The  wash  solution  must  be  fully  rinsed  from  the  slides  or  it  will  interfere 
with  the  coating  process. 

wash  slides  5x  vigorously  w\th  clean  Milli-Q  water  -  e.g.,  put  two  racks  at  a  time  in 
long  narrow  plastic  tubs,  place  a  rubber  band  around  the  slides  -  as  close  to  the  two 
ends  as  possible!!!  -  to  hold  them  in  the  rack  securely,  run  Milli-Q  water  over  racks 
and  then  swoosh  racks  up  and  down  and  back  and  forth  in  water  vigorously, 
repeatedly,  for  maybe  30  seconds.  Dump  water  and  repeat  process  4  more  times. 

let  the  slides  sit  in  clean  water  while  you  prepare  the  polv-Ivsine  solution 

3.  Coat  slides: 

*only  use  PLASTIC  with  poly-lysine!* 

So,  to  coat  the  slides  use  the  empty  plastic  pipette-tip  boxes  labeled  “for  poly-lysine". 
To  make  up  poly-lysine  solution,  use  a  plastic  beaker  or  dish. 

For  coating  2  racks  of  slides  at  once: 

For  750mls  polv-Ivsine  solution: 

(This  solution's  final  concentrations  are  ~0.01 69%  w/v  PLL,  and  0.0984X  PBS.) 

550mls  Milli-Q  H20 

73.8mls  IX  PBS  (kept  in  fridge  once  opened,  post-autoclaving) 
126.72mls  poly-lysine  solution,  0.1%  w/v  in  H20 
mix  ingredients  in  order  listed,  in  plastic,  with  stir-bar  on  stir-plate 

dump  excess  water  from  slides 

place  each  of  two  slide  racks  into  its  own  poly-lysine  pipette-tip  box 
immediately  pour  375mls  polv-Ivsine  solution  over  each  rack 
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close  each  box 
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put  the  two  boxes  into  a  plastic  tub  for  secondary  containment 

let  slides  shake  on  shaker  table  in  polv-lvsine  solution  for  30  minutes 

(Note;  each  box  of  poly-lysine  solution  can  be  re-used  the  same  day,  to  coat  one  more  rack  of  slides. 
Aiso,  if  a  lot  of  slides  wiii  be  prepared  on  several  consecutive  days,  the  soiution  can  be  fiitered  after  the 
first  day  and  stored  in  piastic  in  the  fridge.  When  ready  to  use  again,  add  an  additionai  3-5  mis  of  poly¬ 
lysine.  This  can  be  repeated  a  maximum  of  six  times.) 

4.  Rinse  slides;  All  the  excess  poly-lysine  solution  should  be  removed,  so  that  the 
coasting  is  as  even  as  possible. 

lift  the  metal  slide  racks  out  of  the  poly-lysine  solution,  and  place  the  two  racks  into 
the  long  narrow  plastic  tub  for  washing.  Wash  slides  5x  vigorously  with  clean  Milli-Q 
water  -  following  the  same  protocol  as  first  rinse  step. 

(Meanwhile,  if  needed,  start  next  racks  of  slides  coating.) 

Keep  the  rinsed  slides  in  water  until  they  are  spun  -  they  can  wait  a  few  minutes 
if  someone  else  is  using  the  centrifuge,  for  example. 

Drain  the  excess  water  from  racks,  and  put  them  into  the  centrifuge,  on  top  of 
several  folded  up  large  kimwipes. 

Position  the  racks  with  the  slides  in  a  consistent  orientation  so  you  know  which 
way  is  “facing”  the  direction  of  motion.  (Imagine  riding  on  a  very  fast  merry-go-round 
coating  in  maple  syrup.  As  you  move,  the  syrup  would  dry  off  the  front  side  of  you 
first,  and  also  flow  around  your  edges  a  little  to  build  up  along  the  sides  of  your  back. 
We  see  this  with  the  slides  -  on  their  back  faces,  the  edges  show  signs  of  PLL  wrapping 
around  during  the  spin...  No  problem  on  the  backside,  but  not  so  great  for  the  side  we 
want  to  print  on.) 

Make  sure  the  racks  are  balanced  in  weight  (same  number  of  slides)  and 
position.  Check  the  separation  of  the  slides  in  the  racks  (finger  run  across  top  edge). 

spin  ~1  50xq  (in  Ed’s  plate-spinner  centrifuge,  a  Sorvall  Legend  RT,  with  the 
plate-spinner  rotor  in  it)  for  5-10  min.,  at  room  temperature,  until  dry.  Before  putting  the 
slides  in  the  centrifuge,  spray  out  the  rotor,  plate  holders,  etc.,  with  an  air  canister,  to  prevent  dust. 

remove  the  slides  from  the  racks  and  place  into  a  slide  storage  box,  (having 
sprayed  the  box  dust-free  with  an  air-can),  all  slides  in  the  same  orientation  (e.g.  all 
with  the  slide  faces  that  were  at  the  “front”  -  in  the  direction  of  spinning  motion  - 
pointing  the  same  direction  in  the  box.  Label  the  outside  of  the  box  with  tape  saying 
“PLL-coated  slides”  and  the  date,  and  your  name,  and  the  directionality  of  the  slides  in 
the  box  (based  on  direction  of  spin,  and  see  below). 
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Place  storage  box  into  dessicator  cabinet. 

5.  To  QC  the  slides:  It’s  a  good  idea  to  see  how  the  slides  look,  to  make  sure  nothing 
obvious  has  gone  wrong.  This  can  be  done  the  same  day  or  in  the  next  several  days. 

-  breathe  on  them,  look  for  surface  irregularities. 

-  check  their  background  by  scanning  on  the  array  scanner 

-  edge  effects  not  that  uncommon,  shouldn’t  interfere  with  where  array  goes 

-  slide  sidedness  where  one  face  of  slide  is  more  speckly  than  the  other  also  not 
uncommon.  If  this  occurs,  orient  slides  methodically  in  the  case  so  that  the  better  side 
always  faces  the  same  direction,  and  note  this  on  the  outside  of  the  storage  box  -  this 
will  be  the  face  used  for  spotting  the  DNA  onto. 

-  it’s  probably  good  to  save  a  few  of  the  slide  scans,  and  print  out  on  a  single 
page,  to  keep  a  record  of  that  batch’s  general  properties  post-coating.  Just  doing 
Preview  Scans  is  fine.  Save  the  file,  and  also  export  it  using  both  our  standard 
brightness/contrast  and  the  auto-brightness/contrast. 

Our  standard  settings  for  scanning  PLL  slides: 

635nm  laser  (red):  1 00%,  PMT  Gain:  600 
532nm  laser  (green):  100%,  PMT  Gain:  600 
brightness  at  95,  contrast  at  93 

but,  also  look  at  and  save  a  few  images  using  the  auto-brightness/contrast  button 
our  file  naming  convention: 

Year_Month_Day_description_slidenumber_front_or_back_autosettings 
e.g.  2005_01  _1  9_newPLLslides_l  f  or  2004_1  2_20_newPLLslides_3b_auto 
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Protocol  Summary; 


1 .  Wash  slides 

make  up  wash  solution: 

in  a  glass  beaker,  mix: 

For  60  slides:  1  200mls  For  90  slides:  1  SOOmIs 


1  20  g  NaOH  pellets 
720  mis  95%  EtOH 
480  mis  Milli-Q  H20 


1  80  g  NaOH  pellets 
1080  mis  95%  EtOH 
720  mis  Milli-Q  H20 


For  1  20  slides:  2400mls 

240  g  NaOH  pellets 
1440  mis  95%  EtOh 
960  mis  Milli-Q  H2Q 


pour  SOOmIs  wash  solution  into  each  *clean*  3L  beaker,  containing  rack  of  slides. 

Cover  the  beakers. 

shake  gently  on  table  to  wash,  about  2hrs 

2.  Rinse  slides:  wash  slides  5x  v/gorotvs/v  with  clean  Milli-Q  water 
let  the  slides  sit  in  clean  water  while  you  prepare  the  polv-lysine  solution 


3.  C(xit  slides: 

*only  use  PLASTIC  with  poly-lysine!* 
for  750mls  polv-Ivsine  solution: 

550mls  Milli-Q  H20 

73.8mls  IX  PBS  (kept  in  fridge  once  opened,  post-autoclaving) 

1  26.72mls  poly-lysine  solution,  0.1  %  w/v  in  H20 
mix  ingredients  in  order  listed,  in  plastic,  with  stir-bar  on  stir-plate 

dump  excess  water  from  slides 

place  each  of  two  slide  racks  into  its  own  poly-lysine  pipette-tip  box 
immediately  pour  375mls  polv-Ivsine  solutio nj? ye r_^c_h_racj< 
close  boxes  &  put  into  a  plastic  tub  for  secondary  containment 
let  slides  shake  on  shaker  table  in  polv-Ivsine  solution  for  30  minutes 

any  additional  racks  of  slides  can  sit  in  Mill-Q  water  while  waiting  their  turn  for  the 
poly-lysine 

4.  Rinse  slides: 

Wash  slides  5x  vigorously  with  clean  Milli-Q  water 

(Meanwhile,  if  needed,  start  next  racks  of  slides  coating.) 

Drain  the  excess  water  from  racks,  and  put  them  into  the  centrifuge,  on  top  of 
several  folded  up  large  kimwipes.  Balance  them,  and  note  direction  of  motion  of  slides 
(which  faces  point  “forward"). 

spin  1  50q  for  5  min.,  or  until  dry. 

remove  the  slides  from  the  racks  and  place  into  a  slide  storage  box,  all  in  the 
same  orientation.  Label  the  outside  of  the  box.  Place  storage  box  into  dessicator. 
cabinet. 

<:  <"W’  tk... 
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Form  for  Preparing  Poly-Lysine  coated  slides,  for  microarrays; 

Date: _  Preparer: _ 

Number  of  slides  being  prepped: 

slide  lot  #: _  NaOH  lot  #: _  EtOH  lot#: _ 


NaOH  slide  wash: 

time  in  washing: time  out: total  time: 


PLL  slide  coating: 

L'  batch,  time  in  coatina: 

time  out: 

total  time; 

2"“  batch,  time  in  coatina: 

time  out: 

total  time: 

Notes  about  any  changes,  observations,  accidents,  etc: 


Q/C  of  slides: 

1 .  Visual  inspection  when  breathed  upon: 


2.  General  appearance  using  scanner 


3.  A  few  representative  scans,  using  both  standard  settings  (brightness  95,  contrast 
93)  and  auto-settings: 
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Generating  oligonucleotide  probes  for  the  marine  microbial 
microarray: 

all  Instructions  are  for  a  Mac  set-up 


Step  1 :  Generate  a  fasta  file  with  all  genes  from  each  genome/qenome 

fragment: 

There  are  a  variety  of  ways  to  do  this.  I  now  use  a  perl  script.  However,  an  easy 
way  to  do  one  or  a  few  sequences  is  via  the  program  Artemis. 

-  launch  Artemis 

within  a  Terminal  window,  cd  to  the  Directory  containing  Artemis  and  then  type  “art” 
and  return  -  if  this  doesn’t  work,  type  "esh",  return,  then  try  again.  Artemis  should 
launch.  If  you  search  for  Artemis  in  your  computer,  you  may  find  that  you  have  an 
icon-launchable  version  of  it. 

1 .  Options  ->  click  off  eukaryotic  mode 

Open  ->  File  ->  file  of  choice,  gb  file  (yes,  ignore  the  error) 

2.  Select  ->  CDS  features 

Write  ->  bases  of  selection  ->  fasta  format 

use  a  “.fna"  file  extension,  with  appropriate  prefix 
Naming  conventions:  Location  with  some  indication  of  clone  type  &  depth  If  possible,  and 
coordinates,  eg.  HF10_04C06  is  Hawaii  fosmid  10m,  library  plate  4  coordinates  GOG.  For  files, 
the  convention  is  that  "  fna”  is  used  for  fasta  nucleic  acids,  and  ".faa"  is  used  for  fasta  amino 
acids 


3.  Select  ->  All 

Write  ->  bases  of  selection  ->  fasta  format 

4.  Click  in  the  lower  gene  list  window  and  hit 

<control>  and  mouse  click,  then  ->  Save  List  as 
This  list  is  useful  for  annotating  your  oligos  and  for  quickly  checking  the  gene  content  of  a 
fosmId/BAC  later  without  having  to  launch  Artemis.  Some  versions  of  Artemis  won’t  let  you  do 
this;  if  not,  no  worries. 

•  then  open  the  “.fna”  file  in  BBEdit  (or  a  similar  text  editing  program,  one  which  won’t 
add  a  bunch  of  stuff  like  Word  does). 

clean  up  file  names  as  needed  so  that  they  lack  spaces  or  funny 
characters. 

This  will  be  a  different  process  from  file-to-fi!e  since  for  many  In-house  sequences  we’re  still 
working  with  unpublished  versions  that  may  not  be  perfectly  named,  etc.  Generally,  I  prefer  to 
keep  each  gene  name  as  the  Cloneldentifier_Geneldentlfier  -  often  times  this  will  be  the 
sequence  location  of  the  gene  because  the  gene  names  haven’t  yet  been  added,  or  it  may  just 
be  CDS.OOl ,  _002,  etc.,  depending  on  what  Information  the  Artemis-parsed  CDS  list  contains. 
An  Ideal  naming  would  be,  eg.,  AntFos04D03_0to633,  meaning  AntFos  library  clone  04D03,  the 
CDS  from  0  tO  633. 

-  copy  the  file  into  the  ArrayOligoSelector  folder 

Step  2:  Use  Array  Oliqo  Selector  to  generate  the  potential  probes: 
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ArrayOligoSelector  is  available,  along  with  all  documentation,  at 
http  //arrayoligose_Lsourceforge  net/ 

If  working  on  a  Mac  you  will  need  to  download  the  version  of  formatdb  and  blastall  from  the 
NCBI  website  whose  date  corresponds  to  the  same  release  date  (or  as  close  as  possible)  of  the 
AOS  version  you’re  using,  because  the  bundles  AOS  download  comes  with  a  Linux-compatible 
version  of  formatdb  and  blastall.  If  you’re  doing  more  complicated  things  with  AOS  than  what  is 
described  below,  there  will  also  be  other  things  you  would  need  to  download  separately  to 
allow  complete  functionality  of  the  program,  but  for  the  scripts  we  run,  this  is  sufficient.  I  have 
compared  the  results  of  AOS  set  up  this  way  on  a  Mac  Powerbook  G4  to  those  from  AOS  set  up 
on  a  Linux  machine  and  they  are  identical. 


-  in  the  Terminal  window,  change  into  the  ArrayOligoSelector  directory 

%  cd  [drag  and  drop  ArrayOligoSelector  folder  for  pathname] 

-  script  1  generates  a  list  of  all  possible  oligos  from  the  input  sequences,  in  a 
sliding  window  manner.  The  output  file  is  “output  0",  you  can  view  it  as  text. 

%  pick70_scriptl 

if  this  doesn’t  work,  try  typing  ./pick70_scriptl 

if  you  just  type  this  you’ll  get  a  USAGE  error  telling  you  exactly  what  parameters  you  need  to 
input: 

*inputseq:  gene/NUCLEOTIDE  sequences  submit  for  design  in  PASTA  format 
*genome:  genome  GENE/NUCLEOTIDE  sequences  in  PASTA  format 
*oligo_size:  in  basepair 

*MaskByLowercase:  You  can  exclude  sub  sequences  from  the  compuation  using  lower  case. 
Those  sub-regions  will  be  flagged  in  the  outputs.  To  use  lowercase  for  this  purpose,  type 
"yes";  otherwise,  type  "no". 

In  the  case  of  this  array,  we’re  choosing  70-mers,  and  are  not  doing  any  masking  of  sequence. 

So,  what  we’d  really  like  to  do  is: 

%  pick70_scriptl  <input>.fna  <input>.fna  70  no 
For  historical  reasons,  we  use  the  same  CDS  output  fna  file  as  both  CDS  file  and 
genome  file,  against  which  ArrayOligoSelector  checks  for  uniqueness. 

We  discussed  a  dizzying  array  of  possibilities  for  what  to  use  as  the  genome  file,  concatenating 
all  the  fosmid  and  BAC  sequences,  using  all  the  prokaryotic  sequences  in  the  nr  database  - 
these  days  one  could  imagine  using  all  available  environmental  sequences  as  a  "genome"... 

BUT,  for  our  purposes,  we  did  the  simplest  variant  possible  -  using  the  CDs  file  as  CDS  and 
genome.  For  different  organisms  there  is  different  coverage  of  the  nearby  related  "sequence 
space",  and  this  coverage  changes  all  the  time.  One  could  try  to  make  an  array  with  much  more 
specific  probes,  even  to  the  point  of  doing  alignments,  etc,  as  other  groups  have  done  for  other 
arrays,  but  that’s  not  the  purpose  of  this  array.  The  goal  was  to  see  whether  a  "blind"  design 
approach  would  allow  discrimination  among  related  genotypes,  and  with  the  prototype  array  I 
demonstrated  that  it  did.  If  one  were  designing  a  different  array  for  different  purposes,  or  a 
different  system,  one  might  want  to  use  a  different  design  strategy. 

Script  1  will  show  a  warning  error  because  it  can  tell  we’re  not  running  Linux 
and  they  want  to  make  sure  we’ve  got  the  correct  versions  of  blastall  and  such 
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installed,  and  python  -  even  if  everything  is  good,  it  still  gives  you  this  warning 

-  so  type  yes'  to  proceed  when  queried. 

-  script  2  chooses  among  the  many  possible  oligos  for  each  gene  to  give  you 
the  ones  closest  to  your  desired  parameters. 

%  pick70_script2 

again,  if  this  doesn’t  work,  try  typing  ./pick70_script2 

again,  if  you  just  type  this  you’ll  get  a  USAGE  error  telling  you  exactly  what  parameters  you 
need  to  input: 

*CC:  GC  percentage  (eg:  35.5,  positive  float  or  integer  number) 

*Oligo_len:  length  of  Oligo  in  bp(positive  integer) 

*Number_Oligo  how  many  oligos  do  you  want  to  design  (positive  integer) 

^OPTIONAL  binding  energy  cutoff  0  is  the  default 

*OPTIONAL  masking  parameters:  if  used,  all  the  optional  masking  parameters  are  required 

*Mask_Length  maximum  length  of  subsequence  allowed  containing  the 
Mask.Symbols  eg  20 

*Mask_Symbol  (ATGCN):  masking  bases  eg  AT  or  N 

*Mask_Tolerance  (0  1 )  ‘  percent  of  other  bases  allowed  eg:0. 1 

So,  we’d  like  to  do  40%  GC  content  (which  was  the  average  GC  content  of  the  few  tens  of  fully- 
sequenced  clones  present  in  the  lab  database  at  the  time  I  started  this),  and  70-mers.  and  1 
probe  per  gene,  with  no  binding  energy  cutoff  and  no  masking: 

%  pick70_script2  40  70  1 

-  copy  the  oligodup  and  oligofasta  output  files  from  the  ArrayOligoSelector 
folder  into  a  new  location  (remember,  ArrayOligoSelector  has  to  rewrite  those 
intermediate  files  each  time,  so  you  have  to  save  them  before  you  can  run  it 
again),  and  rename  the  files  based  on  the  clone/organism  name. 

Step  3:  Choose  which  output  oligos  to  use  as  probes: 

Again,  there  are  different  ways  to  process  the  AOS  outfiles...  I  now  use  a 
perlscript  to  do  this,  which  will  get  posted  on  the  website  too,  but  this  is  a 
simple,  alternate  way  to  do  the  same  thing  manually. 

-  open  either  output  files  in  BBEdit,  select  all,  and  <apple>  <F>  to  find  and 
replace  -  click  on  the  lower  left  box  for  a  Multifile  search  to  include  the  other 
file,  and  use  grep: 

find:  \r 

replace;  (just  a  blank  space) 
then 
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find:  > 
replace:  \r> 
save  files 

-  open  both  files  in  Excel  with  a  space  as  column  delimiter, 
merge  the  files  into  one 

sort  by  %GC  (column  D  or  E,  depending) 

if  there  are  <20  oligos  with  4096CC,  then  take  those  just  higher  and  lower 
until  you  have  20.  Highlight  these  20  oligos  -  these  are  your  probes. 

if  there  are  >20  oligos  with  40%GC,  then  sort  among  those  oligos  by  AG 
of  hybridization  (column  G  usually),  and  take  the  20  oligos  with  the  lowest 
(=most  negative)  AG  values,  within  those  that  have  40%GC.  Highlight  these  20, 
these  are  your  probes. 

AC  has  been  shown  to  correlate  Inversely  with  hybridization  signal  for  microarray  probes,  which 
makes  good  sense  -  so  if  you’ve  got  a  surfeit  of  potential  probes  with  the  “right”  %CC,  AC 
makes  a  good  criterion  for  selecting  among  them! 

Copy  and  paste  your  chosen  oligos  into  your  master  oligo  file,  and  proceed  as 
you  see  fit. 

An  important  thing  to  note  here  is  that  “blind”  probe  design  means  that  the  process  outlined 
above  does  riof  targeting  particular  genes  of  interest. 
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Resuspending  oligo  plates  for  arraying 

Coal:  I  use  70-mer  oligos  for  printing  our  microarrays.  I  get  them  from  a  company  (lllumina) 
and  have  them  make  aliquot  plates  for  us  of  400  pmoles  of  each  oligo  in  each  well.  I  have 
lllumina  ship  the  plates  dried  down.  (Side  note:  I  also  have  them  ship  the  remainders,  since 
their  aliquotting  robots  apparently  have  some  lame  limitations  and  so  can’t  dispense  the  last 
few  aliquots  for  us  -  but  we  can  do  it  ourselves  from  the  remainder  plates...)  So,  the  dried  down 
aliquot  plates  need  to  be  resuspended  before  they  can  be  used  for  printing. 

1 .  Volumes: 

First,  determine  what  the  resuspension  volume  will  be.  I  typically  resuspend  a  new  oligo 
aliquot-plate  at  lOul  per  well,  which  gives  40pmol/ul  of  each  oligo. 

After  a  plate  is  used  for  printing,  I  dry  it  down  in  a  speedvac  vacuum  centrifuge  with  a  plate 
holding  rotor.  The  oligos  store  better  dried-down,  and  then  I  don’t  need  to  worry  about  evap 
over  time,  etc. 

Then  in  subsequent  resuspensions,  I  add  the  volume  that  should  cause  the  remaining  oligo  to 
be  at  -40pmol/ul  again.  I  do  this  calculation  by  assuming  that  for  each  inking,  the  wells  lose 
0.5ul  of  fluid  (the  pin  takes  -0.3ul,  but  we  assume  O.Sul  total  loss  to  account  for  evaporation  - 
based  on  personal  communication  with  Kevin  Visconti,  Schoolnik  Lab,  Stanford.  However,  it’s 
always  good  to  check  a  few  wells  at  random  after  a  printrun  to  see  what  is  really  left,  since 
your  loss  will  depend  upon  the  ambient  humidity,  and  how  long  your  plates  were  in  the  stacker 
-  e.g.  if  you  put  them  all  in  at  once  they’d  all  be  exposed  (with  lids  but  no  Al-foil  seals)  a  lot 
longer  than  If  you  put  them  in  one  or  two  at  a  time).  Thus,  if  1  had  a  new  plate  with  1  Oul/well 
and  used  It  for  one  bed  of  slides  that  required  two  inkings.  I  would  assume  the  new 
resuspension  volume  the  next  time  I  used  the  plate  would  be  9.0ul. 

For  the  total  volume  of  printing  buffer  required,  I  multiply  the  volume  per  well,  by  the  plate 
size,  by  the  number  of  plates.  So,  9  ul/well  *  384  wells/plate  *  1  5  plates  -  51  840  ul  - 
S 1 .84mls,  just  over  a  Falcon  tube’s  worth. 

2.  Printing  buffer: 

For  the  first  resuspension,  I  use  3X  SSC.  For  subsequent  resuspensions  (after  the  plate  has  been 
used  and  dried  down),  1  use  0.3X  SSC,  to  account  for  the  small  amount  of  salt  lost  during 
evaporation  (this  Is  what  I  learned  from  the  Schoolnik  Lab).  I  have  a  20X  SSC  stock  (Ambion). 

In  addition.  In  recent  printruns  I  have  been  adding  the  co-spot  oligo  to  the  print  buffer.  I  make 
a  1  OOpmol/ul  co-spot  oligo  stock  solution,  in  3X  or  0.3X  SSC.  Then  1  make  1  pmol/ul  working 
solution,  so  I  dilute  1 :100. 

Primary  Resuspension  Fluid  (first  time  a  plate  is  used): 

final  concentration _ recipe  per  4Qml _ 

3XSSC  6mls20XSSC 

1  pmol/ul  co-spot  oligo  400ul  of  1  nmol/ul  co-spot  oligo 

water  33.6  mis  Ambion  water 

Secondary  Resuspension  Fluid  (each  subsequent  time  a  plate  is  used): 

final  concentration _ recipe  per  4Qml _ 

0.3X  SSC  0.6mls  20X  SSC 

water  39.4  mis  Ambion  water 
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3.  Using  the  aliQuot  robot  by  Cenetix 

1 .  Sonicate  the  disassembled  manifold: 

in  3%  aQu  clean  for  >1  5  minutes 
in  MilliQ  water  for  >1 0  minutes 
in  95%  EtOH  for  >1  5  minutes 

2.  Replace  manifold  (see  diagram  in  manual  if  necessary). 

3.  Run  50mls  of  3%  bleach  (1  Chlorox  :  1  MilliQ;  Chlorox  is  6%)  through  the  robot,  letting  sit  in 
the  tubing  for  ~1  5",  per  the  online  recommendations  for  removing  DNA. 

4.  Run  3  X  50mls  of  MilliQ  through  the  system. 

5.  Rinse  with  1  x  50mls  of  80%  EtOH. 

6.  See  page  1  3  of  the  aliQuot  instruction  manual,  for  the  section  “Running  a  Filling  Routine". 

Adjust  the  manifold’s  start  position,  dispense  height,  and  tilt,  in  order  to  minimize 
splashes.  Start  position  is  adjusted  in  the  software,  the  height  and  tilt  are  adjusted 
mechanically. 

Use  a  dummy  plate  of  the  same  make  as  you’ll  be  using  to  test  the  settings  out  to  see  if 
there’s  splashage. 

PCR  mode  dispenses  volumes  as  multiple  aliquots  of  smaller  volumes,  to  decrease 
splashing.  Accessible  in  software. 

7.  Before  using  real  plates,  test  aliquotting  accuracy  in  both  of  two  ways. 

i.  weigh  a  plate  before  and  after  aliquotting  into  it. 

ii.  use  a  pipetteman  to  test  the  volume  several  wells  of  different  rows,  since  each  row  is 
filled  by  a  different  pin. 

8.  Note:  The  bottle  fill  type  Is  a  50ml  Falcon-type  tube,  BUT  Falcons,  Fisher  brand,  and  BD  brand 
do  not  work.  A  Greiner  tube  Is  in  there  currently,  and  we  have  a  limited  supply  of  other 
Greiners.  They  don’t  all  work  smoothly  either.  In  fact,  if  you  undo  the  existing  tube  it  can  be 
very  difficult  to  get  back  on.  So,  I  clean  the  tube  thoroughly  and  then  use  it  for  dispensing  if  I 
can.  Also,  I  keep  it  screwed  in  to  the  black  connector  piece,  and  unscrew  that  piece  instead 
from  the  arm,  whenever  possible. 

The  dead  volume  (volume  taken  up  from  the  Falcon  tube  before  it  gets  to  the  manifold)  of  the 
aliQuot  is  ~3mls,  as  stated  in  its  instruction  book. 

Side  note:  I  tested  evaporation  from  a  384-well  plate,  unsealed  but  with  plastic  lid  (I  think),  and 
it  ended  up  averaging  ~  0.05ul  per  well  per  hour...  with  the  caveat  that  perimeter  wells  evap. 
faster  than  internal  ones,  and  that  the  evap  process  should  be  non-linear.  This  was  calculated 
from  an  1  8-hour  benchtop  exposure  with  1  Oul  per  well  of  3X  SSC,  and  September  humidity.  (So 
the  total  loss  over  1 8  hours  was  0.9ul). 
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Instructions  for  Doing  a  Printrun  of  70-mer  Oligos  in  3X  SSC 
on  our  Genetix  QArrayZ  Arrayer 

Quick  checklist  for  starting  a  run: 

□  fill  and  start  external  humidifier 

□  turn  big  red  power  knob  on  right  side  of  machine 

□  press  the  red  reset  button  on  the  front 

□  turn  on  the  computer 

□  start  the  software 

□  check  the  water  level  in  the  nebulizer  (see  below),  and  turn  on 

□  wipe  down  inside  with  7096  EtOH,  and  spray  with  airgun 

□  fill  the  water  and  ethanol  bottles,  and  pull  them  forwards  against  the 
front  bar 

□  you  MAY  need  to  refill  these  during  the  run,  check  every  few  hoursi 

□  make  sure  the  waste  bottles  are  empty 

□  you  MAY  need  to  empty  these  during  the  run,  check  every  few  hours! 

□  check  for  pinched  tubing 

□  sonicate  &  dry  pins  (see  below) 

□  load  pins 

□  load  one  test  slide,  and  fill  remainder  of  column 

□  vacuum  on 

□  print  test  (1  slide,  2  fields) 

□  check  test  slide,  if  OK,  load  blotting  slide  and  print  slides 

□  spin  down  print  plates  and  load  into  right-hand  stacker  (recall  A1  goes 
front-right!) 

□  load  correct  protocol  In  program,  confirm  parameters  are  correct,  check 
data  tracking,  and  start 

Quick  checklist  for  ending  a  run: 

□  remove  plates,  re-seal  and  freeze  or  start  drying  in  speed-vac 

□  turn  on  light,  and  use  head  icon  to  remove  the  pins,  and  start  them 
washing 

□  turn  vacuum  back  on,  affix  labels  to  slides,  then  turn  off  vacuum,  remove 
slides  and  place  on  clean  surface  to  cut  label  strips 

□  turn  off  the  software 

□  turn  off  the  computer 

□  turn  off  the  machine 

□  turn  off  the  external  humidifier 

□  pull  the  arraying  head  into  the  middle  of  the  bed,  close  door. 


The  Physical  Set-UP: 

The  pins: 

Qfficially,  the  1  50um  tips  produce  a  ~1  90um  spot,  use  when  spotting  densities 
reach  20,000-30,000 
The  75um  tips  produce  a  90um  spot 
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Both  diameters  are  “regular  volume”  *unless*  otherwise  specified  -  both  come 
in  low  volume  options.  Regular  volume  is  approx  250nl  per  inking,  low 
volume  is  1  00-1  50nl. 

In  the  pin  boxes,  Cheryl  doesn’t  keep  the  rubber  stoppers  on  the  bottom  of  the 
pins,  she  just  pushes  them  in  gently  with  the  top  parts  flush  with  the  foam 
so  that  the  pin  tips  are  well  clear  of  the  foam.  She  also  numbers  the  pins  in 
the  foam  and  keeps  them  in  the  same  order,  for  tracking  purposes 

To  wash  the  pins  prior  to  loading  in  the  machine:  Place  them  in  the  specially- 
designed  tip-washing  manifold/stand.  Place  the  stand  into  a  beaker  with 
an  approx.  296  solution  of  aQu  clean,  or  dilute  detergent.  Place  beaker  in 
sonicating  water  bath.  Sonicate  1 0-1  5  minutes.  She  says  she  RE-USES  the 
2%  aQu  solution  several  times.  She  also  bought  just  the  smallest 
sonicating  water  bath  that  VWR  sells  and  uses  that. 

Then  rinse  in  Milli-Q  H20,  sonicating,  for  1  0-1  5  minutes. 

Then  either  air-dry,  or  dunk  in  EtOH. 

*lf  the  pin  is  persistently  clogged,  she  heats  up  the  296  soap  in  the 
microwave  and  lets  the  pins  sit  sonicating  in  the  warm  soapy  water. 

You  NEVER  should  need  to  clean  the  Head  in  the  microarrayer,  which  actually 
holds  the  pins.  It’s  made  inside  of  ball  bearings  so  don’t  *ever*  take  it  apart! 
Maybe,  if  the  pins  aren’t  sliding  in  smoothly,  use  compressed  air,  gently. 

To  load  pins:  Clicking  the  head  icon  in  the  software  will  bring  the  head  to 
the  front  left  of  the  bed  for  loading.  The  top  of  the  pins  are  not  radially 
symmetric,  they  are  rounded  and  then  have  one  straight  edge,  this  edge 
should  sit  flush  with  the  metal  bars  on  the  top  of  the  head. 

The  plates: 

Genetix  X7022,  which  has  covers  and  is  V-bottom.  She  recommends  no  lower 

than  4ul  in  them,  but  has  gone  as  low  as  2ul. 

Loading  plates:  Recall  they  go  in  REVERSE  ORDER  (from  one  prespective), 
with  the  first  plate  being  on  the  BOTTOM  of  the  “in"  stack.  Also,  they  go 
BACKWARDS,  with  the  A1  well  going  in  the  bottom  front  right!!  Plates  go 
into  the  right-hand  stacker.  Twist  the  knob  to  right  to  lock  stacks  into 
place. 

The  stackers  are  interchangeable. 

Evaporation  from  stackers.  Usually  not  a  problem,  but  one  customer  put  a 

plastic  bag  over  the  stackers  and  shoved  a  humidifier  tube  up  inside,  but  that 

was  for  a  70-plate  30,000-spot  run. 

The  machine: 
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The  tip  washing  manifold  in  the  arrayer  is  a  flow-through  design,  first  H20, 
then  EtOH.  Filled  from  containers  underneath,  and  each  into  their  own  waste 
container.  We  filled  with  Milli-Q  water  and  80%  EtOH.  The  smallest  container 
underneath  is  the  equalization  bottle  -  you  should  *never*  see  liquid  in 
there,  if  you  do  call  Tech  Support. 

Fill  the  humidifier,  from  the  capped  opening  on  the  front  right  corner  of  the 
platter,  inside  the  arrayer  hood.  Fill  with  DI-H20  or  MilliQ.  It  takes  about  two 
liters  to  fill  up,  and  use  a  funnel  to  fill  it,  and  fill  it  until  the  level  reaches  the 
bottom  of  the  plastic  aperture  through  which  you’re  filling.  There’s  a  digital 
display  on  the  front  right  of  the  machine  that  controls  the  humidity,  and 
holds  it  at  +/-  2.0%.  For  a  run  in  the  winter  set  to  55-65%.  The  ambient  can 
get  down  to  8%  in  the  winter!  If  the  humidity  is  low,  the  spots  can  bleb 
together,  and  if  it’s  really  low,  the  spotting  solution  can  dry  in  the  pins  and 
then  those  pins  won’t  print  those  spots.  Use  1-2  external  humidifiers  in  the 
room  as  well,  with  door  closed. 

**Talk  to  facilities  and  get  the  fan  turned  down  in  that  room  during  the  winter. 

On  the  START  page  of  the  software,  clicking  the  □  “Reset  Outputs  After  Run’’ 
will  stop  the  humidifier  after  the  run  is  completed,  which  is  good  so  that  it 
doesn’t  run  dry  -  especially  a  problem  in  the  winter  -  if  it  runs  dry  the 
motor  on  the  pump  can  burn  out. 

Loading  slides: 

You  have  to  fill  up  an  entire  column  of  the  bed  with  slides  to  get  a  vacuum  seal 
for  that  column  -  you  don’t  have  to  print  on  them  all  though,  you  can  just 
use  Junk  slides  to  finish  filling  a  column  in  need  be.  You  can  control  air  flow 
to  each  column  separately. 

Always  front-right  Justify  the  slides  (except  for  the  blotting  slide(s)  see  below), 
e.g.  using  forceps,  and  then  turn  on  the  vacuum  (icon  at  top  bar  of 
program). 

The  knobs  at  the  front  of  each  column  control  the  vacuum  for  that  column. 

When  the  vacuum  didn’t  come  on,  Cheryl  tried  bleeding  the  line  at  the  thingie 
on  the  right  side  of  the  machine,  lower  half,  where  there’s  a  pressure  gauge 
-  she  bled  the  line  here  (by  pushing  something  in?)  until  the  compressor 
came  on.  That  didn’t  fix  it,  but  seemed  to  be  her  first  trouble-shooting  step. 
Dave  had  to  come  and  fix  it  later. 

Cleaning: 

before  each  run  she  wipes  down  the  inside  -  walls,  everything  -  with  EtOH  and 
kimwipes.  She  also  blows  compressed  air  ‘gently*  down  the  grooves  on  the 
slide-bed,  pushing  dust  etc.  towards  the  back,  and  then  wipes  across  the 
back.  If  wiping  the  slide  ruts  themselves,  then  wipe  right-to-left,  since  the 
left-hand  side  of  the  slides  is  less  important  since  that  (based  on  our  current 
config)  is  where  the  labels  will  go. 
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The  Software  Set  up: 
you  can  store  protocols. 


WELCOME  tab  (meaningless) 

DESCRIPTION  tab:  lets  you  gave  a  name  and  write  lab  notes  for  the  printrun 
HEAD  CONFIGURATION  tab 

you  can  use  a  total  of  48  pins 

they  have  many  configs  programmed  in  the  pull-down  that  let  you  work 
from  either  96-well  or  384-weli  source  plates 

you  can  set  up  novel  configs  as  defined  in  a  config  file 
we  choose  e.g.  "16  pin  Microarraying  head"  (default  is  384-well) 
you  don’t  need  to  tell  it  what  diameter  pins,  it  doesn’t  care 
it  shows  you  the  ‘correct* * **  way  to  load  the  pins  in  the  head,  based  on  the 
config  you’ve  selected.  Follow  its  instructions! 


SOURCE  tab: 


Plate  Holder:  Stacker  Source  Plate  Holder 


use  stacker  0 


Plate  type:|  7022 


Total  Plates:  1 ,  etc.  (If  >70  it  pauses  to  let  you  refill) 


Source  order  by: 


O  columns  (means  the  head  dips  into  the  source  plate  proceeding  from 


A1-*  PI,  so  Al,  El...) 

•  rows  (means  the  head  dips  into  the  source  plate  proceeding  from  Al  — 
A24,  so  Al,  AS,  A9) 


SLIDE  DESIGN  tab: 

slide:|  3"  x  1  ”  (1 6pins/  4fields) 

you  match  this  to  the  actual  number  of  pins  being  used,  and  the  “field"  is 
the  #  of  times  that  that  printhead  could  physically  fit  onto  that  size  slide;  e.g., 
using  the  48-pin  head  config.,  it  could  only  fit  on  the  slide  once. 

_ with  a  16-pin  config  you  can  actually  print  at  a  higher  density 

field  layouL]  can  organize  replicate  fields,  etc. 


**  double-click  on  any  of  the  spots  to  open  a  new  screen,  allowing  you  to  edit 
more  parameters  of  slide  design: 


spot  view: 

O  layout  estimated  spot  size:  200  for  1  SOum  tips, 

90  for  75um  tips 

•  actual  (click  actual)  (she  nudges  this  a  little  higher  so 
there’s  good  spacing) 

Pattern  dimensions  tab: 

calculate  with  calculator. 
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384  pins  *  6  replicates  /  1 6  pins  =  1 44  spots,  =  a  1  2x  1  2  matrix 
given  the  entered  matrix,  it  automatically  gives  you  the  max.  pitch  for  the 
row  and  columns;  you  can  decrease  the  pitch  to  bring  the  spots  closer  if 
desired. 

e.g.  if  the  pitch  =300,  and  the  spot  size  =  200,  then  there’s  1  OOum 
between  spots,  which  is  fine 

Fill  tools  tab: 

Replicate  type: 

O  Adjacent 
O  Sector 
•  Cyclic 
O  Random 

Statistically,  random  really  is  the  best 

If  using  marker  spots  from  a  separate  plate,  click  on  markers,  then  hit 
remove  all,  then  highlight  the  desired  spots  and  type  M,  these  then 
become  red,  designated  as  marker  spots,  and  then  when  you  fill  in  the 
spots  it  will  assign  the  pattern  around  these  marker  spots,  leaving  room 
for  them  separately. 

SLIDE  LAYOUT  tab: 

#  slides: _ 

#  blots: _ (uses  a  sample  slide  location,  and  can  blot  multiply  on  a 

single  “blotter"  which  in  this  case  is  just  a  discard  slide.  You  can  set  the  blotting 
pitch  farther  apart  and  overprint  the  same  blotting  spots 

□  change  blotters  blotters  to  use:  □  (this  lets  you  change 

the  number  of  blotters,  it  will  pause  the  run  for  you  to  change  blotters) 

Blot  overprint  method:  □  same  sample 

□  no  overprint 


replicate  count  =  6 

where  to  start  (diagrams): 

(choose  upper  left  prob) 


^lide  order:|  (diagrammatic) 

can  drag  slide  layout  around  is  you  want  to  .  The  layout  can  become  important 
if  you  want  to  do  humidity  testing,  arrange  the  spots  to  be  printed  around  the 
edges  to  check  the  corners  of  beds. 


PRINT  tab: 


Max  stamps  per  inking:]  (approx.  200-300  in  SSC) 


#  stamps  per  spot:|  =  1 . 


You  can  change  this  if  you’re  doing  e.g.  protein  microarraying.  -  can 
overstamp,  or  stamp  in  an  offset  circle  centered  around  primary  spot,  etc. 
Stamp  tirn^  0 

She  does  0  but  you  can  increase  it  20-30  (all  in  milliseconds)  if  you’re 
seeing  non-uniform  features,  or  have  a  more  viscous  fluid. 
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Inking  time:|  =  2000  (this  is  appropriate  for  regular  volume  pins,  you  can 


decrease  the  number  for  lower  volume  pins. 


Print  depth  adjustment:!  she  adjusts  this  so  the  pins  just  barely  hit  the  slide  - 


"very’'  tough  to  see,  but  makes  a  slight  tapping  noise  during  a  run.  The  manual 
says  to  do  this  through  Datum  Points,  and  then  use  the  print  depth  adjustment 
to  Just  vary  if  there’s  a  known  alternate  slide  type  you  use,  etc. 


TOUCH  OFF  tab: 

won't  need  this  for  SSC.  It’s  used  for  more  viscous  buffers,  to  wick  off  the 
excess  solution  from  the  "outside*  of  the  pins,  by  touching  the  surface  of  the 
liquid  in  the  plate. 

STERILIZE  tab; 


You  can  bring  the  EtOH  up  to  3500ms  and  also  up  the  water  times  if  you’re 
starting  to  worry  about  cross-contam,  but  these  are  the  params.  she  likes 

(there’s  two  diff  washing  philosophies,  one  as  above,  the  other  with  longer 
washes,  eg  1  water  wash  at  5000ms,  same  EtOH.) 


Water!  ”  4  each  at  1 000ms  wash  time,  0  dry  time,  500  wait  time 


EtOH  -  1  -  at  2500ms  wash  time,  7000  dry  time,  500  wait  time 


DATA  TRACKING  tab: 

First,  on  the  desktop,  change  the  comma-delimited  .csv  file  from  Excel  to 
.txt,  then  open  the  separate  Data  Tracking  Program.  The  username;  dtuser, 
password;  dtuserpw 

admin  username:  sa,  pword:  genetixsapw 
Tools  —  Import  Process  File,  Files  of  type  Qsoft...  etc. 

Close  Data  Tracking  Program,  and  back  in  Qsoft  under  Data  Tracking  tab; 

File  name:  (for  gal  file)  File  format:  gal4.1  works 
Open  Croups  -»  OligoPrototypeArray  (whatever  group  you  want,  you’ve  loaded) 
highlight  the  plates  wanted,  and  click  Add. 
the  TOP  plate  in  the  software  =  the  first  plate  out  of  stacker  =  the  BOTTOM 
plate  in  stacker 


Always  backup  the  Database  before  reading  new  (e.g.  rubbish)  data  in! 
BARCODINC  tab:  N/A 


START  tab: 

O  Normal 

O  Test  plate  (inks  ONLY  from  the  first  1 6-pin  quadrant.  Just  to  print  a  few 

slides) 

O  Print  Test  (inks  and  then  prints  onto  Just  the  front  left  slide,  without  re¬ 
inking.  Let’s  you  see  if  all  the  pins  are  clean,  see  the  numbers  of  times  you  can 
print  from  a  given  inking,  and  the  #  of  times  you  might  need  to  blot,  etc.) 
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O  Data  tracking  export  only  (just  generates  the  gal  file  for  a  given  data 
tracking  import) 


Other  Important  thing  on  the  Software: 


If  you  ever  want  to  just  get  the  gal  file  made  for  a  given  run,  or  made  alone, 
you  find  it  through:  My  Computer  -*  C  -*  Program  Files  -*  Cenetix  —  QArray  — 
logs  -*  gal 


IConfiguration  Icon]  on  Toolbar 
(double  click) 

scroll  down: 

offset  field  1 X 


offset  field  1  y 


(offset  field  x,  y 


-*  Defined  Objects  -»  3x1  (16pins/4  fields) 


this  controls  how  far  from  the  skinny  “top"  edge 
the  field  starts.  She  changed  it  from  1  000  to 
1  500,  for  the  purposes  of  the  print  test, 
this  controls  how  far  from  the  long  “side”  edge 
the  field  starts.  She  changed  it  from  3900  to 
41  00,  eventually  to  4300,  to  get  the  field  more 
centered  on  the  slide. 

Just  control  the  distance  between  fields) 


Inking  depth:]  This  is  how  deep  into  a  plate  you  go.  She  has  it  set  for  their 


plates,  and  set  for  4-1  Oul  volume.  If  we  go  above  1 0,  to  1  5  or  so,  it’s  worth 
changing  the  defined  object  inking  depth  so  we  waste  less  oligo  on  the  outside 
of  the  pin.  Currently  it’s  set  for  about  200um  above  the  bottom  of  the  plate. 
You  can  change  the  inking  depth  for  a  new  defined  object,  and  save,  or  you  can 
change  it  within  a  given  run  only. 


Print  depth^  This  needs  to  be  done  for  particular  slides.  Co  to  IConfigut^ 


Datum  Points!  in  the  toolbar,  and  set  the  heights  for  the  slides  in  each  column 
empirically.  To  adjust  the  print  depth,  negative  numbers  =  up  higher. 


She  saved  a  protocol  called  "Print  Test"  which  prints  two  fields,  and  prints  one 
slide  only.  She  changed  the  field  layout  for  it. 


icon  at  top  =  Diagnostic  button,  lets  you  move  things 
drag  the  head  where  you  want  it,  make  a  wash 


Hammer  and  Screwdriver 
around,  etc.  You  can  Just 
happen,  etc. 


Slide  layout  for  first  run  was: 
sample  slides:  1 4 

#  blots:  5 
blot  pitch:  300 

#  overprints:  1  BUT  check  off  No  overprint,  below. 
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Water  =  2000  wash  time,  4  replicates,  0  dry  time,  wait  500 
EtOH  =  4000  wash  time,  once,  1  0000  dry  time,  wait  500 

save  routine  icon  at  top,  looks  like  save  as  (double  disk) 


To  back-up  datum  points  and  protocol 

Robot  config  -»  Database  -*  Back-up  (on  C  drive,  -»  Qsoft  back-up) 


Maintenance  and  Troubleshooting: 

If  the  arrayer  is  not  going  to  be  used  for  >=  week  or  so,  she  says  to  empty  the 
bottles  and  leave  everything  dry,  so  that  nothing  grows.  She  also  says  you  may 
want/need  to  wash  the  tubing  once  in  a  while:  says  some  customers  have  run  a 
mild  fungicide  through  and  then  flushed  with  water  -  this  is  under  I/O 
diagnostic  section,  Microarraying  Wash,  cick  on  Port  1  (=wash).  Switch,  and 
Vacuum,  and  then  the  wash  runs  continuously  through  both  lines  until  you  stop 
it. 

To  test  if  the  pins  are  drying  post-wash,  and  if  they’re  washing  enough  or  if 
there’s  carryover,  cut  a  small  piece  of  e.g.  nylon  and  tape  it  on  a  slide.  Then 
print  (print  test)  on  it  and  see  if  it’s  wicking  off  moisture,  and  if  it’s  colored  (use 
a  dilute  blue  food  coloring  in  your  spotting  buffer  to  see  if  there’s  carryover). 
Use  an  empty  source  plate  if  you’re  just  trying  to  see  if  the  dry  time  is  sufficient 

-  it  needs  a  plate  in  there  to  go  through  the  printing  motions. 

Blotting: 

5-10  blots  OK  for  3xSSC 

she  set  for  5,  all  on  1  slide  (program  figures  how  many  slides  automatically) 
the  blot  slide  doesn’t  stick  to  field  size  but  starts  at  far  left  hand  side  -  for  this 
reason,  this  slide  should  NOT  be  bottom-right  Justified  but  be  a  littie  more 
centered! 

Troubleshooting:  back  pins  not  printing,  tho  freshly  cleaned.  Tried  washing  tray 
again,  still  no.  Switched  pins  front-to-back,  and  back  half  still  not  printing  -  so 
pins  are  OK.  Checked  balance  of  head  and  bed  with  a  mini-balance.  Turns  out 
head  itself  is  not  level. 

Cheryl’s  free  advice  on  plate-sealers  (any  gunk  left  on  top  may  prevent  top  from 
coming  of  plate  cleanly,  and  screw  up  the  machine): 

-  small  hand-held  heat  sealers  units  (for  sealing  foil)  from  Marsh  (now  bought 
by  Fisher) 

-  or  HyperTask,  small  local  company,  have  foil  plate-sealers  she’s  liked  and  not 
had  probs  with. 
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U.K.  Tech  Support  (US  toll  free  #): 

1-877-436-3275 

favorite  dude;  Tim  Roberts,  ext.  4796,  expert  on  QArray2 
tim.roberts@qenetix.com 

U.S.  Tech  Support: 

1-877-436-3849 

spoke  previously  to  Joe  Jordan  there. 

Another  local  tech/rep  person  is  Ken  Adams,  61  7-549-6050 
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Array  Post-processing  Protocol 

Adapted  from  DeRisi,  Schoolnik,  and  Somero  Lab  protocols  (developed  by  Andy  Cracey), 
DeRisi  microarray  course  notes,  and  personal  experience. 


Goal:  After  the  arrays  have  been  printed  with  DNA  spots,  this  protocol  is  used 
to  bind  the  DNA  more  tightly  to  the  slide,  remove  excess  DNA  that  hasn’t  been 
bound,  and  block  any  free  lysines  on  the  poly-lysine  slide  coating  (those  free 
lysines  are  “sticky”  and  if  not  blocked  they  could  non-specifically  bind  labeled 
probe  during  an  array  hybridization.)  Since  slides  age  more  quickly  after  they’ve 
been  post-processed,  it  makes  sense  to  just  post-process  a  small  batch  at  a 
time,  as  they  are  needed. 


Required  Equipment  most  of  it  in  my  bench  cupboards 
slide  rack(5) 

a  1 L  and  a  50ml  Erlenmeyer  flask 
2  3L  beakers 
50ml  Erlenmeyer 
rotator-table 

humid  chamber  (Sigma  H6644) 
heating  block 

dust-free  board  for  cross-linking  (e.g. 


piece  of  cardboard) 
centrifuge  with  adapter  for  slide  rack 
diamond  scribe  etcher  (in  my  bench 
drawer) 

powder-free  gloves 

UV-crosslinker  (Stratagene  above  Steve’s 
bench) 


Required  Chemicals 
20X  SSC,  1  096  SDS 

1 M  Sodium  Borate,  filtered  (make  from  Boric  acid,  adjust  pH  with  NaOH):  Boric  Acid’s  FW  = 
61.83 

1  -Methyl-2-pyrrolidinone  (anhydrous,  99. S%,  FW*=99.1  3)  -  Sigma  M6762-1  L  -  do  not  use  if  it 
appears  yellowish 

Succinic  Anhydride,  6g  (99-J‘%,  FW=1  00.07,  a  moisture  sensitive  irritant)  -  Aldrich  239690- 
SOG  -  do  not  use  if  it  has  been  exposed  to  moisture.  Keep  in  a  dessicator,  sealed 
with  parafilm. 

9S96  Ethanol  (do  not  make  from  100%,  it  is  made  differently)  -  do  not  use  if  it  is  cloudy 
or  has  particulates 

Before  you  start:  prep  work  can  take  up  to  -  an  hour 

**  always  rinse  all  glassware,  etc.  with  Milli-Q  water  before  use  to  remove  dust** 

**  always  use  powder-free  gloves  when  working  with  arrays!** 

1)  take  Methyl-pyrrolidinone  out  of  fridge  and  put  in  hood  to  come  to  room 
temperature. 

2)  make  up  *fresh*  Na-Borate  (do  this  first  so  it  has  time  to  cool  before 
pHing) 

you'll  need  25.71  mis,  so  make  e.g.  40mls  and  then  discard  excess: 
in  a  50ml  Erlenmeyer,  with  a  small  stir  bar: 

35mls  Milli-Q  H20 
2.473g  Boric  Acid 

mix  on  high,  with  heat,  until  dissolved 
cool  to  room  temp 

adjust  to  pH  =  8.0  with  1  ON  NaOH  (takes  >20  drops,  so  start  there. 
Do  NOT  work  back  with  HCI  if  you  go  past.  Start  fresh  with  new  solution.) 
check  volume  -  add  Milli-Q  H20  up  to  40mls 
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filter  through  0.2um  syringe-filter,  into  a  50ml  Falcon  tube 

3)  wipe  down  bench  area  with  EtOH 

4)  heat  up  heat  block  on  bench  to  at  least  90*C 

5)  place  in  the  chem.  hood: 

•  stirring  plate  for  preparation  of  blocking  solution 
■  shaker,  with  secondary  containment  tray  taped  down 

6)  make  up  pre-wash  solution:  1 X  SSC/  0. 1  %  SDS 

for  ZOOmIs  (to  use  for  1  rack  of  slides,  in  a  3L  beaker): 

658mls  Milli-QH20 
35mls  of  20X  SSC 
Zmis  of  10%  SDS 

mix  briefly  with  stir  bar,  remove  stir  bar,  and  cover  beaker. 

Step  1 .  Rehvdrate  the  slides 

This  step  recovers  the  spot’s  circularity,  and  decreases  ’’donating”  of  the  spots. 

a)  wipe  down  the  bench  with  70%  EtOH  and  kimwipe  {not  paper  towel)  to  remove 
dust 

b)  put  warm  tap  water  into  humid  chamber,  and  place  chamber  next  to  heat  block 

c)  Rehydrate  slides  by  inverting  (array  side  down  toward  steamy  water)  them  over 
warm  water  in  a  slide-staining  chamber  -  don’t  use  much  water  or  it  can  splash 
up  on  slides.  Watch  until  the  spots  become  glistening  and  juicy,  3-10  minutes. 
(Under-hydration  causes  too  little  DNA  to  stay  adhered  to  the  spot  during  the 
subsequent  washing,  and  over-hydration  causes  the  spots  to  be  blebby.  AC  says 
2-3”  is  usually  sufficient,  the  DeRisi  protocol  says  1  -10”.) 

Be  careful  not  to  allow  the  water  to  touch  the  array. 

d)  Immediately  flip  them  (array  side  up!)  onto  a  heating  block  (inverted,  about 
90‘C).  Watch  the  steam  evaporate.  When  the  array  spots  dry  in  a  rapid  wave-like 
pattern,  remove  them  from  the  heating  block.  This  takes  about  5  seconds.  Do  1 
slide  at  a  time. 

Step  2.  UV  cross  link 

This  step  helps  the  DNA  stick  to  the  poly-lysine. 

a)  Place  the  slides,  array  side  up,  on  a  flat,  dust-free  board  that  fits  into  the  UV 
cross-linker  (I  use  a  pre-cut  piece  of  cardboard  that  I  keep  in  my  cupboard). 

Do  not  put  them  on  a  seran  wrap  surface  since  the  slides  stick 

to  it. 

b)  Irradiate  with  600  ujoules  UV  light  -  press  the  “ENERGY”  button  and  then  enter 
600,  then  press  “Start”.  It  will  count  down  and  beep  when  done.  (Andy  does  his  at 
650  and  does  them  in  the  metal  rack,  not  laying  them  flat.) 

c)  Before  the  next  step,  etch  the  slides  with  a  diamond  scribe  (in  the  top  middle 
drawer  of  my  bench)  to  demarcate  where  the  array  is  -  after  the  pre-wash  the 
spots  will  become  invisible! 

Step  3.  Pre-Wash  (a.k.a.  the  “shampoo  method”) 

This  step  removes  excess,  unbound  DNA  to  prevent  ’’pluming”  of  the  DNA  out  from 
the  spots.  Some  protocols  suggest  skipping  this  step  if  the  Initial  spots  are  small,  however 
skipping  this  would  then  require  an  extra  step  later,  which  is  not  in  this  protocol. 
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a)  Place  the  slides  in  the  slide  rack  and  secure  them  with  a  strip  of  metal  wire  on 
top,  or  rubber  bands.  If  rubber  bands,  then  position  as  close  to  the  rack  edge  as 
possible.  (AC  uses  rubber  bands.) 

b)  rapidly  plunge  slides  into  lx  SSC  /  0.1%SDS  for  30  sec 

c)  Gently  wash  slides  with  lots  of  Milli-Q  water,  swish  rack  back  and  forth 
(something  like  5  consecutive  gentle  rinses  of  30+  seconds  each).  Let  sit  in  water 
briefly  while  preparing  blocking  solution.  (Pre-wash  protocols  will  vary  on  the 
details,  but  all  have  the  same  general  principle.  Some  also  spin  slides  dry  before 
blocking,  but  AC  doesn’t,  and  drying  doesn’t  appear  necessary  or  beneficial.) 

Step  4.  Block  free  Ivsine 

wear  a  lab  coat  when  working  with  methyl  pyrrolidinone 

a)  In  a  1  L  beaker,  add  8.643g  of  succinic  anhydride  into  526ml  1  -Methyl-2- 
pyrrolidinone  while  stirring  with  stir  bar  on  stir  plate. 

b)  As  soon  as  the  solids  dissolve,  (though  they  may  not  dissolve  completely,  some 
protocols  warn  -  mine  always  has  dissolved),  quickly  add  23.57ml  of  1  M  Na- 
Borate  pH  8,  and  pour  the  mixed  solution  into  a  3L  beaker. 

c)  Quickly  place  the  slides  (in  metal  rack)  into  the  succinic  anhydride  solution  (do 
not  pour  the  solution  over  the  slides)  and  plunge  up  and  down  for  60  seconds. 
Rotate  at  60  rpm  for  Ihour  if  possible  -  as  little  as  30  minutes  probably  OK.  (AC’s 
surface  chemist  friends  say  the  process  doesn’t  go  to  completion  for  about  an 
hour,  though  some  protocols  call  for  as  little  as  1  5”.)  While  blocking,  set  up  3L 
beaker  with  Milli-Q  water  on  hotplate,  in  hood,  and  heat  to  boiling. 

d)  Remove  the  slide  rack  from  the  organic  reaction  mixture  and  place  it 
immediately  into  the  boiling  Milli-Q  water  bath  (some  protocols  call  for  room 
temp,  feel  free  to  try  this  out  and  see  which  works  better,  just  make  a  note  of  it!  I 
haven’t  noticed  a  difference,  actually)  and  wash  thoroughly  but  gently  by 
swishing  rack  back  and  forth  for  90  seconds. 

e)  Transfer  the  slide  rack  to  a  3L  beaker  containing  approximately  575mls  of  95% 
ethanol  (do  not  make  from  100%),  plunge  slides  to  mix,  and  then  carry  directly 
in  EtOH  to  the  tabletop  centrifuge. 

f)  Spin  dry  the  slides  by  centrifugation  at  1  50  x  g  for  2  min.  Use  a  counter  balance 
with  the  same  number  (&  orientation)  of  slides  in  a  rack.  (Balance  slides  are  in  the 
top  right  drawer  of  my  bench). 

g)  Carefully  transfer  the  slides  to  a  dry  slide  box  for  storage  in  a  dessicator.  Make 
sure  the  slide  box  is  appropriately  labeled. 

h)  Collect  methyl-pyrrolidinone  solution  as  waste  for  periodic  EHS  pick-up. 
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Array  Post- Processing  Ponn: 

Date: _  Person: _ 

Which  slides  being  post- processed  (printdatc,  series): _ 

Total  number  = _ 

Notes  on  solution  making: 

Notes  on  re-hydraiion: 


Notes  on  pre-wash: 


Time  in  to  blocking  solution: _ Time  out: _ 

When  out  of  blocking  solution,  into  HO  I  or  COLD  water  bath?  (circle  one) 
Notes  on  blocking  steps: 
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Amplification  and  Labeling  of  DNA  for  Microarray  Hybridization, 
using  the  “Round  A/B/C”  Random  DNA  Amplification  Protocol 


Goal:  To  amplify  and  label  DNA  prior  to  hybridization  to  a  microarray,  in  a  relatively  random 
way.  This  protocol  does  not  give  linear  amplification,  but  as  DeRisi  says  it  “is  useful  to  compare 
relative  enrichment  between  two  samples.”  DeRisi  reports  that  it  has  been  successful  in  their 
hands  for  amplifying  less  than  1  ng  of  genomic  DNA.  I  have  obtained  results  from  as  little  as 
several  hundred  picograms  of  environmental  DNA  but  a  safer  lower  limit  starting  amount  of 
DNA  seems  to  be  -6ng  per  slide  hyb  (see  caveats  below). 

Protocol  History:  I  adapted  this  protocol  from  that  used  byjoseph  DeRisi’s  Lab  at  U.C.  San 
Francisco,  and  theirs  was  adapted  from  Bohlander  et  al.  Genomics  1  3  (1 992). 

Overview:  There  are  three  stages  of  this  protocol.  In  Round  A,  the  Sequenase  polymerase 
extends  random  primers  with  specific  ends  (Primer  A)  that  have  annealed  to  the  template  DNA. 
In  Round  B,  conventional  PCR  amplifies  the  templates  from  Round  A,  using  the  specific  primer 
(Primer  B)  which  matches  the  3’  end  of  Primer  A.  In  Round  C,  Primer  B  is  used  again  to  mediate 
rounds  of  conventional  PCR  during  which  modified  nucleotides  are  incorporated  for  labeling. 
These  modified  nucleotides  are  typically  either  amino-allyl-dUTP,  for  indirect  labeling,  or 
nucleotides  that  are  directly  coupled  to  Cy  dyes.  I  use  the  aa-dUTP  indirect  labeling  so  that  is 
what  is  described  here.  There  is  less  discrimination  by  the  polymerase  against  the  smaller  aa 
dUTPs  than  against  large,  bulky  Cy-dNTPs. 

Understanding  the  Round  A/B/C 
Random  Amplification  and  Labeling  Protocol 
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Precautions:  This  is  a  random-amplification  protocol,  which  means  that  ANY  DNA  can  be 
amplified.  Therefore,  use  filter  tips,  wear  gloves,  and  UV-sterilize  your  tubes  along  the  way. 
Also,  run  negative  control  reactions.  Also,  because  it’s  a  random-primed  multi-round 
exponential  amplification,  we  might  worry  about  stochastic  skewing  of  the  relative  abundance 
of  different  organisms  during  amplification.  The  protocol  partly  accounts  for  this  by 
subsampling  each  amplification  as  the  template  for  the  next  round  (shown  to  be  beneficial  to 
PCR  evenness  generally  in  Thompson  et  al.,  2004).  In  addition,  I  choose  to  run  triplicate 
amplification  reactions  and  pool  them  prior  to  labeling.  Pooling  multiple  reactions  has  also 
been  shown  to  decrease  random  biases  introduced  early  during  amplification  (Thompson  etal., 
2004). 
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Item 

SuDDlier 

Item  # 

Ceneral: 

Nuclease-Free  Water 

Applied  Biosystems/Ambion 

#AM9937 

Positive  control  DNA 

In  my  case,  Halobacterium 

ATCC 

#700922D 

Round  A 

Seguenase  (13  units/^il) 

US  Biochemical 

#70775 

5X  Seguenase  Buffer 

included 

Seguenase  Dilution  Buffer 

included 

“Sequenase  Version  2.0  DNA  polymerase  Is  a  genetically  engineered  form  ofT7  DNA  polymerase  which  retains 
extraordinary  polymerase  activity  with  virtually  no  3’->5’  exonuclease  activity.  It  is  highly  processive,  able  to  effectively 
Incorporate  nucleotide  analogs  for  sequencing  (dideoxy  NTPs,  -thio  dATP.  diTP.  7-deaza  dCTP.  etc.)  and  Is  not  easily 
impeded  by  template  secondary  structure." 

“A”  dNTPs  =  3  mM  each  nucleotide 
500  ^.ig/ml  BSA 
0.1  M  DTT 

40  pmol/fil  Primer  A:  e.g.  Proligo  N/A 

5’-Gl  I  ICC  CAG  ICACGA  ICN  NNN  NNN  NN  3’ 

Round  B 

lOX  Mg-minus  PCR  Buffer  to  match  the  Tag 

(500  mM  KCI,  100  mM  Tris  pH  8.3) 

25  or  50  mM  MgCI2 

"B”  dNTPs  «  25  mM  each  nucleotide 

5  U/nl  Tag  polymerase  e  g.  Invitrogen 

1 00  pmol/^l  Primer  B:  e.g.  Proligo  N/A 

5’  •  G  I  I  ICG  CAG  I  CA  CGA  I  C  -  3’ 

Round  C 

Same  as  Round  B  except  use  modified  “C”  dNTP  mix 
Their  recommended  recipe  is; 

25  mM  each  dATP.  dCTP  and  dCTP 
10  mM  dTTP 

1  5  mM  aminoallyl-dUTP  (or  Cy-dUTP)  Ambion  #AM8439 

However,  they  suggest  that  the  ratio  of  aa-dUTP  to  dTTP  can  be  altered/optimized  My 
optimized  recipe  is: 

22.5  mM  each  dATP.  dCTP  and  dCTP 
9  mM  dTTP 

1 1 .75  mM  aminoallyl-dUTP 
For  100  ^il  this  corresponds  to: 

22.5  fil  lOOmM  dATP,  dCTP,  and  dCTP 
9ril  lOOmM  dTTP 

23.5  ril  50mM  aa-dUTP 


aa  dUTP  structure  (for  L-A  ©),  and  the  general  amine-ester  reaction  employed  for  dye  coupling; 


Succinim  dyl  ester 


Cartxnan'  de 
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Protocol 

Round  A:  Denature  template  DNA,  anneal  primers,  and  extend. 

I  imc  At  least  I  hoar  JO  minutes-,  inireasinn  signifu  antty  based  nn  number  of  leui  Hiins 


Round  A.  Step  1 :  First  strand  synthesis. 

Each  reaction  receives: 

Inaredient _ Volume 

Template  DNA  7  jil 

(e.g.  6  ^1  template  DNA  and  1^1  positive  control  DNA) 
5X  SequenasG  Buffer  2  |il 

Primer  A  (40  omol/^l) _ LyJ 


Total  Volume  =10^1 

To  standardize  things  I  prepare  a  master  mix  of  5X  Sequenase  buffer  and  Primer  A,  and  then 
dispense  it  into  my  tubes  -  either  0.2ml  PCR  tubes  or  a  PCR  plate.  For  triplicate  reactions,  I 
dispense  3X  of  this  master  mix  into  the  wells,  add  3X  of  my  DNA,  mix,  and  then  aliquot  Into 
three  separate  tubes,  or  three  separate  rows  If  using  a  PCR  plate.  This  works  well. 

Use  “vr-a"  cycling  protocol  on  "Goldie"  thermal  cycler**; 

Heat  2  min  at  94  C 

Rapid  cool  to  1  0  C  and  hold  5  mm  at  1  0  C. 

/  use  this  thermai  cycle  because  I  have  programmed  it  to  have  approximately  the  correct  ramp  time  for  later  steps. 
Other  thermal  cyclers  have  different  ramping  speeds  and  so  will  need  to  be  programmed  accordingly. 


With  program  paused  at  10  C  and  the  tubes  in  the  thermal  cycler,  add  5.05  fil  Reaction 
Mixture  to  each  reaction  (having  assembled  reaction  mixture  in  UV*hood). 


Inaredient 

Volume 

5X  Sequenase  Buffer 

1  Ml 

“A"dNTPs  (3mM) 

1.5  |il 

0.1  M  DTT 

0.75  jil 

500  ^ig/ml  BSA 

1.5  Ml 

Seauenase  (1  3U/ul) 

QJ nl 

Total  Volume  *  5.05/// 


Again,  I  make  a  master  mix  of  this  reaction  mixture,  in  the  UV-hood,  and  then  dispense  it  at  the 
thermal  cycler  into  each  tube  or  well. 

Ramp  from  10  C  to  37  C  over  8  min. 

Hold  at  37  C  for  8  min 

Rapid  ramp  to  94  C  and  hold  for  2  rniri. 

Rapid  ramp  to  1  0  C  and  hold  for  5  min 

Round  A.  Step  2:  Second  strand  synthesis 

Pause  at  1  0  C  while  adding  1 .2  |il  of  diluted  Sequenase  (1 :4  dilution  in  Sequenase  Dilution 
Buffer). 


Ramp  from  10  C  to  37  C  over  8  min. 

Hold  at  37  C  for  8  mm 
END 

In  PCR  hood,  dilute  samples  with  Amblon  Water  to  final  volume  *  60  |il  (should  be  60  -  1 0  - 
5.05  -  1.2  -  43.75UI). 
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Round  B:  PCR  amplification. 


Mix  in  a  0.2ml  PCR  tube,  in  the  UV-hood. 


Ingredient _ Volume 

Round  A  Template  6^1 

50mM  MgCI2  4  ^1 

1  OX  Mg-minus  PCR  Buffer  10^1 

“B”  dNTPs  (25mM)  1  ^1 

Primer  B  (1  OOpmol/nO  1^1 

Taq  l^il 

Ambion  Water _ 77ul 


Total  Volume  =  1 00  pit 

Use  "vr  b”  cycling  protocol  on  “Goldie”  thermal  cycler: 

30  sec  at  94  C 
30  sec  at  40  C 
30  sec  at  50  C 
2  min  at  72  C 

Run  1  5-35  cycles,  depending  on  the  amount  of  starting  material.  I  typically  use  20  cycles. 

If  you  run  5  pi  of  each  reaction  product  on  a  1  %  agarose  gel,  you  should  see  a  smear  of  DNA 
between  500bp  -1  kb.  To  minimize  the  number  of  cycles  you  run,  the  first  time  you’re  working 
with  a  new  type  of  template  they  recommend  removing  aliquots  of  your  reaction  (of  which  you 
have  extra  to  spare,  don’t  worry)  every  2  cycles  or  so  and  checking  them  on  a  gel  -  you  want  to 
use  the  minimum  number  of  cycles  that  produces  a  visible  smear  of  product  DNA,  and  that  still 
keeps  the  negative  control  lanes  empty. 


Round  C:  Incorporation  of  aa-dUTP. 

'  .h'pi'nilifhi 

They  recommend  using  10-1  5  pi  of  Round  B  to  seed  the  Round  C  reaction.  I  use  10  pi. 
Ingredient _ Volume 


Round  B  Template 

10  pi 

50mM  MgCI2 

4  pi 

1  OX  PCR  Buffer 

10  pi 

“C"  dNTP  mix 

1  pi 

Primer  B  (lOOpmol/pl) 

1  pi 

Taq 

1  pi 

Water 

73  ul 

Total  =  7  OOul 


Use  “vr  c”  cycling  protocol  on  "Goldie"  thermal  cycler 
30  see  at  94  C 
30  sec  at  40  C 
30  sec  at  50  C 
2  rnin  at  72  C 

1  0-25  cycles  can  be  run,  I  typically  run  1 0  cycles. 
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Clean-up  I: 

Salts  and  Tris  interfere  with  dye  coupling,  so  before  proceeding  you  must  clean  up  the 
reactions.  They  recommend  using  a  Microcon  size-exclusion  column  to  do  this. 

Add  400  fil  water  to  the  sample  in  a  Microcon  30 

Spin  M.OOOxg  until  liquid  mostly  drained.  Empty  collection  tube. 

Wash  1 X  with  SOO  ii\  Ambion  water. 

Concentrate  to  -^9.3  n\  in  Ambion  water:  8  ^1  will  be  used  for  the  labeling  reaction,  and  --1  ^1  will 
be  used  to  Nanodrop  the  sample.  I  record  the  volume  1  was  actually  able  to  concentrate  the 
sample  to  (it’s  tricky,  and  I  can’t  usually  get  to  9.3  exactly)  so  that  I  can  calculate  my 
amplification  efficiency  if  I  want  to,  and  also  understand  the  comparability  of  my  samples. 
Obviously,  one  wants  the  volumes  to  be  as  close  as  possible  to  one  another  between  samples 
to  permit  the  most  comparability. 

Also,  I  combine  my  triplicate  reactions  at  this  stage,  pre-labeling.  You  can  combine  triplicates 
prior  to  Microconning  but  beware  that  if  you  pool  the  negative  controls  before  you  clean  them 
they  may  be  VERY  slow  to  drain. 

If  doing  lots  of  samples,  instead  of  using  Microcons,  I  use  an  ExcelaPure  96-well  size-exclusion- 
column  plate  with  the  vacuum  manifold.  I  run  it  at  10"Hg  so  as  not  to  lose  DNA  <300bp.  Note 
that  this  size  exclusion  cutoff  is  a  little  bigger  than  the  Microcon-30's.  50,  for  any  experiment 
or  for  experimental  series  you’d  like  to  be  able  to  compare,  it  would  be  advisable  to 
consistently  use  one  or  the  other  clean  up  method. 

Wash  1  X  300  n\  of  Ambion  water 

Resuspend  in  -30  Ambion  water,  transfer  to  a  v-bottom  96  or  384-well  plate. 

Use  the  vacuum  centrifuge  with  the  plate  rotor  to  dry  down  the  DNA.  Use  e.g.  the 
automatic  spin  with  2  hours  vacuum  spin,  1  hour  at  45  deg.  C.  Then  resuspend  your  DNAs 
directly  in  0.1  M  NaHC03  (allowing  to  sit  at  e.g.  60  deg  C  for  10”,  then  vortex  gently  and  spin). 


Labeling: 

8  fil  aa-DNA 

2  fil  0.5M  NaHC03  (0.1  M  final  concentration  in  the  DNA  mixture) 
mix 

OR  1  0  ril  aa-DNA  resuspended  in  0.1  M  NaHC03 
5  pi  Cy  dye  (3  3  ^g  in  DMSO) 
mix 

incubate  at  room  temperature  in  the  dark  for  1  hour. 

Co-spot  complement:  If  you  are  using  the  co-spot  complement  as  well,  you  will  have  done  a 
single  separate  labeling  reaction  of  that,  linked  to  Cy5.  I’ve  found  that  using  -1  pmol  of  the  co¬ 
spot  complement  oligo  per  array  hybridization  works  fine.  In  my  case,  the  co-spot  complement 
that  performed  the  best  and  that  I  ended  up  using  was  the  “alien"  complement  oligo  from 
Urisman  et  ai,  2005. 

Quenching:  If  using  the  co  spot  complement,  you’ll  combine  the  differently  labeled  DNAs  at  the 
hybridization  stage.  You  wouldn’t  want  any  residual  uncoupled  dyes  to  cross-label  the  wrong 
DNA.  Although  rinsing  with  TE  will  quench  the  labeling  reaction,  and  should  remove  uncoupled 
dyes,  for  best  practices  you  should  AL50  use  the  traditional  chemical  quenching  protocol  step 
of  adding  2  [i\  of  4M  hydroxylamine  to  each  reaction,  mixing,  and  allowing  them  to  sit  in  the 
dark  an  additional  1  5  minutes. 
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Clean  up  II: 

Now  you  want  to  remove  unincorporated  dye  molecules.  Use  the  single-column  or  96-well  size 
exclusion  column  plates,  as  before.  Now,  however,  wash  with  TE.  The  TE  helps  inactivate  the 
dye  conjugation  process. 

Add  480  nl  (or  280  ^1  if  using  Excela-Pure  plates)  to  your  samples. 

Transfer  to  the  columns. 

Spin. 

Wash  columns  2x  500  ^il  TE  (or  2x  300  fil  if  using  Excela-Pure) 

Concentrate  to  9  \i\  in  TE,  or  more  if  you’re  doing  triplicate  slides. 

Note:  do  not  use  the  Excela-Pure  plate  to  clean  the  co-spot  complement.  This  oligo  Is  smaller 
than  the  cutoff  of  the  Excela-pure  columns. 
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Hybridizing  Cy-labeled  DNA  to  homemade  PLL  oligo  microarrays 

Protocol  History;  I  adapted  this  protocol  from  that  used  byjoseph  DeRisi’s  Lab  at  U.C.  San 
Francisco. 

Materials 


Supplies: 

0.5ml  tubes 

Lifterslips,  Erie#22x40l-M-55 1  6,  available  as  VWR  #48382-242 

Heat  block  with  O.SmI-tube  block 

Hyb  ovens  with  accurate  and  precise  temperate 

Monitoring  digital  thermometer 

Hyb  chambers  (e.g.  from  Cenetix) 

Microcentrifuge 

Centrifuge  with  plate-spinning  rotor 
Slide  racks 


Reagents: 

2  OX  SSC 

HEPES,  make  to  IM.  pH7 
Ambion  H20 
polyA,  make  lOmg/ml 
1  0%  SDS 


Applied  Biosystems/Ambion 
Sigma 

Applied  Biosystems/Ambion 
SIgma-Aldrich 

Applied  Biosystems/Ambion 


#  AM9770 
#H4034-25C 

#  AM9937 
#P9403-25MC 

#  AM9822 


Protocol 


The  total  volume  of  your  hybridization  reaction  will  depend  on  the  size  of  your  lifterslip.  I  am 
currently  using  mid-sized  lifterslips  with  a  recommended  volume  of  29|il;  I  use  30)^1.  For  the 
prototype  array.  I  used  smaller  lifterslips  for  which  my  hybridization  volume  was  25^1. 


For  one  reaction; 

For  3.1  reactions: 

For  all  reactions: 

Multiply  column  1  values  by  3. 1  if  you  are 
hybridizing  triplicate  arrays  for  each 
sample. 

Multiply  coluryui  1  or  2  values 
(depending  on  if  doing  triplicate  or 
single  arrays)  by  10%  of  the 
number  of  samples  you've  got. 

DNA: 

DNA: 

19.83nl  Cy3-DNA 

61.47mI  Cy3-DNA 

lul  CyS-cospot-complement 

3.1ul  CyS-cospot-complement 

Mix  HI: 

Mix  HI: 

Mix  HI: 

4.49nl  20X  SSC 

13.92nl  20X  SSC 

2  OX  SSC 

0.62nl  IM  HEPES,  pH  7.0 

1.92nl  IM  HEPES.  pH  7.0 

IM  HEPES.  pH  7.0 

2.24ul  Ambion  H20 

6.94nl  Ambion  H20 

Ambion  H20 

5:=  7.34nl 

I  -  22.78nl 

For  H2.  multiply  values  by  ~  /  iOS 
of  the  *  of  samples. 

Mix  H2: 

Mix  H2: 

Mix  H2: 

1.22nl  lOmg/ml  polyA 

3.78|il  lOmg/ml  polyA 

lOmg/ml  polyA 

0.62nl  10%  SDS 

1.92nl  10%  SDS 

10%  SDS 

1.84nl 

1=  5.7mI 

V  SOal 

X  - 

-  Make  up  Mix  HI  and  H2,  mix  each  well 


216 


1  I  i  <1  vrich^mit.edu 


-  Aliquot  Mix  HI  into  0.5ml  tubes,  one  for  each  sample.  Thus,  if  hybing  3  arrays  per  sample, 
make  hyb  mix  for  all  three  slides  in  same  tube. 

-  Add  CyS-DNAs,  if  relevant,  and  add  Cy3-DNAs.  Mix  well. 

-  Add  H2  into  each  tube,  and  mix  thoroughly  by  pipetting. 

-  Heat  at  1  00  deg  C  for  2”  if  30|.il,  4”  if  93^1 

-  Spin  max  speed  1  ’ 

-  Load  samples  onto  arrays,  quickly,  and  load  arrays  into  pre-heated  hyb  chamber  (with  water  in 
base). 

Note:  If  doing  many  hybridizations  at  once,  I  will  heat,  spin  and  load  tubes  1  chamber  at 
a  time,  so  9  or  1  0  slides  at  a  time.  You  don’t  want  your  DNAs  to  cool  off  too  much  between 
when  you  heat  them  and  when  you  load  them  on  the  array  and  get  them  into  the  warm 
chamber.  So  how  many  you  do  at  once  partly  depends  on  how  fast  your  technique  is. 

-  Hyb  arrays  overnight,  >»  1  2  hours. 

Washing  Arrays: 

Prepare  in  bowls: 

Wash  Solution  I 

18ml  20X  SSC 
1.8ml  10%SDS 
580.2ml  MilliQ  H20 

-  Remove  1  hyb  chamber  at  a  time  from  hyb  oven.  Quickly,  transfer  slides  from  hyb  chamber  to 
a  slide  rack  submerged  in  Wash  Solution  I. 

For  doing  many  slides  at  once,  I  have  two  bowls  of  Wash  SItn  I  set  up,  and  use  the  first 
for  gently  swooshing  off  the  coverslip  and  have  the  slide  rack  in  the  second  (gentle  coverslip 
removal  can  be  tricky  with  the  slide  rack  in  the  same  bowl).  To  remove  the  coverslip,  I  hold  the 
slide  horizontally  and  submerge  it  into  the  solution,  moving  It  down  while  tilting  it  forwards 
and  moving  it  back,  all  at  once  with  a  swoop  of  the  wrist.  This  allows  the  coverslip  to  float  off 
cleanly  with  minimal  chance  of  it  scratching  or  touching  the  array  as  it’s  coming  off.  In  theory. 
Experiment  and  find  your  own  best  way  to  do  it  -  sometimes  the  PLL  coating  can  be  very 
delicate  and  you  really  want  to  be  as  gentle  as  possible. 

Rinse  slides  In  Wash  SItn  I  vigorously  for  30  sec  by  plunging  slide  rack  up  and  down. 

I  use  a  plastic  tub  around  the  bowl  for  this  because  1  always  splash  a  lot. 

-  Transfer  the  slide  rack  to  Wash  SItn  II,  blotting  base  of  slide  rack  on  kimwipes  to  remove 
excess  SDS. 

-  Rinse  slides  in  Wash  SItn  II  vigorously  for  30  sec 

-  Cover  the  bowl  with  foil  so  It  Is  dark,  and  transfer  bowl  to  rotator.  Rotate  max  speed  allowable 
(keeps  slides  covered  still  and  doesn’t  splash)  for  5". 

Note  If  I’m  doing  a  lot  of  slides  at  once,  certainly  If  doing  half  the  rack  or  more,  I 
transfer  the  slides  to  a  clean  bowl  of  Wash  SItn  II  after  2.5”. 

-  Quickly  transfer  the  slide  rack  to  a  plate-rotor  In  the  centrifuge,  into  a  rotor-cup  lined  with 
large  kimwipes.  You  may  blot  the  slide  rack  on  other  kimwipes  as  you  transfer  it.  Make  sure 
you  have  set  up  balance  slides  during  the  previous  step  -  you  want  the  slides  to  spin  ASAP  once 
they  are  out  of  the  liquid.  Spin  the  slides  90  x  g  for  5”,  or  until  dry.  Spin  the  slides  with  the 
array  face  facing  Into  the  direction  of  spin. 

Note:  Make  sure  the  centrifuge  is  very  clean  before  you  do  this.  Dust  is  your  enemy  -  it  auto  fluoresces. 


Wash  Solution  II 

1 .8ml  20X  SSC 
598.2ml  MilllQ  H20 
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Extracting  data  from  array  tiff  files  in  GenePix  6.1 

Note;  some  of  these  features  are  not  available  in  CenePix  6.0. 

Presumes  some  prior  experience  with  CenePix. 


1 .  Make  a  master  Settings  file: 

a.  Open  first  array  image,  (=  a  tiff  file). 

b.  Load  the  appropriate  array  list  for  this  array,  (=  a  gal  file). 

c.  In  block  mode,  select  all  blocks  and  align  overall  array. 

d.  Double-click  on  any  block,  and  adjust  spot  diameter  (with  “apply  to  all” 

checked  off)  so  that  it  is  a  little  smaller  than  the  average  spot  size 
on  your  actual  arrays  (in  my  case,  1  50um). 

e.  Align  each  block  individually,  more  precisely. 

f.  Adjust  the  auto-alignment  specs: 

Set  Options  Box  (Alt  -i-  I),  Alignment  Tab  as  follows: 

Click  on  “Find  circular  features” 

Click  on  “Resize  features  during  alignment” 
between  e.g.  70%  and  150% 

Click  on  “Limit  feature  movement  during  alignment"  to  e.g.  40um 
Toggle  for  unfound  features  to  be  “Unflagged” 

(No  CPI  threshold  -  default) 

(Check  Align  Blocks,  estimate  warping  and  rotation  -  default) 
(Automatic  Image  Reg  Max  translation  1  0  -  default) 

(No  sub-pixel  reg.  allowed  -  default) 

g.  Then  save  all  of  this  by  going  to  Save  Settings,  which  will  create  a  gps 

file.  This  will  be  your  master  Settings  file  to  use  on  the  other  arrays 
from  the  same  print  &  hyb  date. 

2.  Auto-aliqn  the  array  grids  for  each  array  scan: 

a.  Click  on  Batch  Analysis  tab  in  the  main  CenePix  window. 

b.  Click  “Add"  and  select  your  tiff  files  to  process 

c.  Select  all  tiff  files  and  click  Add  gps  file,  and  choose  the  master  gps  file 

for  this  set  (the  file  you  just  created  above). 

d.  Uncheck  “Analyze”,  leaving  “Align”  checked  only. 
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e.  Click  on  “Configure  Alignment"  box,  and  within  it  check  "Find  Array”  and 

“Align  Features”  ONLY  -  UNcheck  “Find  Blocks”.  (This  is  because  on 
our  high-density  array,  several  of  the  blocks  are  very  close  together 
and  so  block-finding  gets  confused.  With  the  master  gps  file 
tailored  to  each  array  printrun,  block-finding  isn’t  needed  anyway 
for  good  gridding  if  the  other  two  alignment  types  are  used.) 

f.  Click  "Co",  with  “All  at  once”  checked.  Depending  on  how  many  files  you 

have,  this  can  take  several  hours. 

3.  Check  the  new  alignments. 

a.  Use  the  results  browser  window  (this  will  come  up  automatically  during 

batch  the  alignment.  If  you  close  it  accidentally,  you  can  reopen  it 
by  clicking  within  the  Batch  Analysis  tab,  click  on  the  lower  right- 
hand  array-like  icon),  to  check  the  new  gps  alignment  files  it 
created  for  each  tiff.  Clicking  on  a  gps  file  within  the  browser  box 
will  take  you  to  the  Image  tab  of  the  main  CenePix  window,  and  will 
load  the  tiff  and  its  associated  newly-created  gps  file. 

b.  Manually  inspect  EACH  gps  and  tiff  pair:  manually  adjust  stray  features, 

and  flag  areas  of  surface  PLL  peeling,  or  excessive  background,  as 
“Bad",  to  be  discounted  from  further  analysis. 

c.  Save  each  gps  file  using  the  same  name  as  before  -  replace  the  previous 

version. 

4.  Extract  the  data  (you  can  do  this  immediately  or  at  a  later  time): 

a.  In  the  Batch  Analysis  tab,  delete  all  files. 

b.  Click  on  “Add”,  then  select  all  tiff  files  AND  their  associated  gps  files  at 

once  and  click  OK  -  this  will  link  each  file  correctly. 

c.  Check  “Analysis"  and  uncheck  “Alignment". 

d.  Click  “Go”  and  “All  at  once”.  Again,  this  may  take  a  while  depending  on 

how  many  files  you’re  doing. 

e.  You  may  wish  to  also  use  a  flag  feature  query  -  e.g.,  to  automatically 

flag  as  bad  all  spots  in  areas  of  background  peeling.  To  set  this  up, 
you  must  first  go  into  the  Results  Tab  and  Click  on  Flag  Features, 
and  make  a  new  query  to  suit  your  needs.  E.g.,  I  created  a  query 
called  “test”  which  for  most  of  my  arrays  successfully  flagged  many 
of  my  missing  features,  using  the  following  syntax: 

[B532]  <=  1  00  AND  [B635]  <=  1  00 
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H\  \  irjiitiL!  l^ch.  '.'ladiMit;  ^luJcn:  m  ilio  Del  oiij  I  ;il'  vrichf?/ init.cdii 

[  ,In!  modltli.\l  I  2*'  ON 

This  had  the  effect  of  removing  features  along  the  edges  of  the  array 
where  the  PLL  coating  may  have  peeled  away.  Also,  if  there  are 
large  interior  peeled  sections,  it  removed  those.  HOWEVER,  it  was 
still  unable  to  find  smaller  patches  or  scratches,  or  features  on  the 
edge  of  a  patch  that  should  have  been  flagged  because  either  part 
of  the  feature  itself  was  peeled/scratched  or  part  of  its  local 
background  was. 

In  addition,  the  "test"  query  I  create  works  for  most  but  not  all  of  my 
slides  -  if  some  have  unusually  low  background,  then  it  artificially 
flags  my  data  as  “bad”  even  though  the  spots  are  there.  So  you 
must  tailor  this  to  your  particular  slides  based  on  their 
background,  and  also  judge  whether  to  used  it  based  on  the 
homogeneity  of  your  background  among  your  slide  set.  For  me,  it 
actually  wasn't  worth  using  the  auto-filter  query  since  I  was 
manually  flagging  each  gps  file  for  a  variety  of  things  anyway 
before  extracting  the  data. 

Another  way  to  do  it,  computationally  longer  but  perhaps  easier 

depending  on  your  particular  slides,  would  be  to  have  it  autoalign 
and  then  immediatately  analyze,  using  a  “flag  features  as  bad" 
query.  THEN  go  through  each  results  file,  and  you’ll  see  which 
features  have  been  autoflagged  on  the  corresponding  gps  that 
loads.  You  can  do  additional  flagging  at  that  stage,  re-save  over  the 
gps  files,  and  re-run  the  analysis  to  save  over  the  previous  analysis 
files.  Clunky,  but  may  be  worth  it  based  on  your  particular  specs. 

Also,  you  CAN  get  a  lot  more  sophisticated  with  your  queries.  For 
example,  in  order  to  auto-flag  features  (spots)  that  are  partially 
peeled,  or  at  the  edge  of  a  peeled  region,  or  have  some  other 
aberration,  you  might  like  to  is  have  a  query  that  says  e.g.  "if  e.g. 
>20%  of  the  pixels  in  the  feature,  or  background,  are  less  than  e.g. 

1  00,  flag  as  Bad".  I  asked  Sandra  Lew  about  it,  and  she  said  it’s 
something  you  should  be  able  to  do  with  VBScript  in  the  Results 
Tab  under  the  Flag  Features  button.  She  recommends  going  to  the 
GenePix  Help,  where  there  are  chapters  on  scripting  under  the 
Index  Tab,  describing  some  commonly-used  functions. 

5.  So,  that's  it  -  that's  the  full  pipeline  I  used  to  get  my  results. 

Currently  I  extract  all  the  data  the  software  will  give  me,  in  case  I  ever  want  to 
go  back  to  a  parameter  I  don’t  currently  use  but  which  ends  up  being 
important,  but  you  can  decrease  the  columns  of  data  that  you  get  if  you  so 
desire. 
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GenePix  contact  people: 


vrichfa  mil.cdu 

i  '  i‘  i,.' 


The  software  technical  details  guru; 

Sandra  Lew,  Sandra.Lew@moldev.com 

The  woman  who  updated  our  hardware  and  software:  Yvonne  FitzGerald, 
vvonne.fitzqerald@moldev.com 

She  recommends  that  if  we  have  any  further  problems  with  software  crashes 
(which  we  had  for  a  while  after  she  first  installed  6.0  and  6.1  on  our  new 
machine,  before  she  uninstalled  both  and  reinstalled  6.1),  then  we  make  the 
noise  to  get  a  formal  field  engineer  out  here  to  look  at  the  problem,  because 
she  says  we’ve  exhausted  her  knowledge,  so  if  the  reinstall  didn’t  work  then 
someone  else  needs  to  have  a  go.  Since  we  just  bought  a  new  machine  (new 
computer  +  software  package,  in  March  2008)  we  ARE  under  warranty  for  some 
period,  but  I  don’t  know  how  long  -  so  if  the  problems  re-occur  then  it  would 
be  wise  to  get  them  seen  to  asap. 

Our  local  GenePix  sales  rep,  who  has  given  us  a  loaner  computer  when  the  last 
one  broke;  David  Micha,  david.micha@moldev.com 
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Interactive  Excel  Worksheet  for  Calculating  Reagents  Needs  for  Genome  Proxy  Array: 


How  many  samples  wJ[l_you  hybridize^  Fill  In  highlighted  c^s^in  this  worksheet  onl^yi,  calculations  are  then  automatic 
and  propagated  Into  the  next  worksheet,  which  summarizes  the  order  you  should  place. 

20 

How  many  replicate  A/B/C  reactions  will  you  run?  (Three  recommended,  with  pooling  prior  to  labelling.  Can  then  split  prior  to  hybridization.) 

3 

How  many  replicate  arrays  will  you  hybridize? 

3 


AmeitntJ?trjL20. 

Total 

Waaqant/Cooaumable 

ftotk  Cone 

>Mde$ 

unlls 

amount 

Miking  PLL  slides 

Gold  Saal  Micro  Sltdas 

solid 

120 

N/A 

60 

NaOH  pallats 

solid 

240 

9 

120 

5760  2880 

EtOH 

95% 

960 

mis 

480 

0.5ul  oper  Inking  •  2  Inkings  ■  '-l.Oul,  ■  Ipmoles, 

P8S 

IX 

73.8 

mis 

36.9 

for  aach  of  (384*15  wells  -  5760  walls),  par  85  arrays, 

Poly-L'lysine  solution 

0.1%  (w/v) 

126.72 

mis 

63.36 

so  5760  pmolas  par  85  arrays 

co-spot  ollgo 

1  pmol/ul 

8131.76 

pmolas 

4.0659 

nmolas 

"<1.0  ul  of  EACH  per  85  arrays,  so  40pmoles  per  85  arrays 

array  ollgos 

40pmol/ut 

56.47 

pmoles 

0.0282 

nmolas 

Or,  if  losing  ~  0.68ui  par  wall,  then  for  co-spot,  ■ 

Amount  per  30 

Post~Droc€$$ina  of  Stidts 

flides 

2  inkings  *  0.68pmolas  par  wall  *  5760  wells  ■  7834  pmoles 

Amount  per  120  Total  amount 

succinic  anhydride 

solid 

8.643 

9 

17.286 

slides  required; 

l-Mathyl-2-pvrrolld1nona 

526 

mis 

1052 

11059.76  S.S299 

Boric  Add 

solid 

5 

9 

10 

And  for  "raal"  oligos,  2  inkings  *  0.68ul  par  inking  *  40pmoles/i 

ssc 

20X 

35 

mis 

70 

■  54.4  pmoles  per  bed-full 

SDS 

10% 

7 

mis 

14 

76.80  0.04 

EtOH 

95% 

575 

mis 

1150 

Vol  per  one  rxn 

AIS/Q 

Hatobactahum  DNA 

lOng/ul 

1 

uls 

60 

Primar  A 

40pmol/ul 

1 

uls 

60 

uls 

90 

DTT 

O.IM 

0.75 

uls 

45 

8SA 

500ug/ul 

1.5 

uls 

90 

Saquanasa 

13U/UI 

0.6 

uls 

36 

MgClZ 

SOmM 

4-1-4 

uls 

480 

uls 

60 

Primar  8 

100  pmol/ul 

1  1 

uls 

120 

Taq 

5U/ul 

1  +  1 

uls 

120 

Ambion  water 

pure 

43.75  -1-  77-1-  73 

uls 

11625 

"C"  dNTP  mix 

various 

1 

uls 

120 

Vol  per  Jimple 

can  pool  replicate 

Libelling 

A/B/C  reaction* 

NaHC03 

0.5M 

2 

uls 

40 

33ug  in  Sul 

Cy3  dya 

DMSO 

33 

ug 

660 

Ambion  water 

pure 

300-1-  50 

uls 

7000 

TE 

IX 

500  -1-500-1-19 

uls 

20400 

Co- spot  comp/ement  iibiUns  ■ 

co-spot  complement  ollgo 

1  nmol/ul 

1 

pmoi 

60 

NaHC03 

0.5M 

0.02 

uls 

0.33 

33ug  in  Sul 

CyS  dye 

DMSO 

0.28 

ug 

5.5 

Ambion  water 

pura 

2.92 

uls 

7000 

TE 

IX 

8.49 

uls 

20400 

Hybndutitgn 

SSC 

20X 

4.49 

uls 

269.4 

HEPES 

IM,  pH7 

0.62 

uls 

37.2 

Ambion  H20 

pura 

2.24 

uls 

134.4 

TE 

IX 

3.5 

uls 

210 

polyA 

lOmg/ml 

1.22 

uls 

73.2 

SDS 

10% 

0.62 

uls 

37.2 

So,  for  dNTPs  that  mains: 

par  reaction,  assuming  lOOmM  stock  solution; 

Round  A  Round  8  Round  C 

0.045  0.5  0.5 

so,  total: 


uls  1.045 

62,7 


What  about  aa-dUTP? 

23.5  pi  50mM  aa-dUTP  par  lOOul  of  C-dNTP  mix 

so  for  abova 
#  of  samplas, 

naad  14.1  uls  of  SOmM  aa-dUTP 
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So,  final  tally  of  reagents'  volumes  required  that  you  should  order: 


Orh#f 

loay  fo 

(Mtil  vf  P9t  fout  CqsI  ikt  lpX 

ItrMtmt  Cofif.  Anttumt  /tm*  #  Norr-^  un/r  c«>«r  itfrav  Mrny 

ftottf  tor  ffKH  smtron.  ,vtd  po<rt  orofj^stOQ  ttm  fli  t  %<*<»(«  i«  ifornt  m  Thu*.  f#<*  'totai  vohim^K  nKjt/nnr  tr*  wtchtr  at  youtf 

jctua^  M.  UTKe  ffn?v  <«pres«ot  the  Mnount  for  the  <Ktud^  *  ofsMes  yvu're  hybtno  rotttet  chon  fui  the  nert  iatoei  botch  aoe.  whch 
H  how  you  '*W  pmre^f  them 


Poly  1  lysine 
sointiw) 

0. 15G 
{w/vl 

63.36 

nU% 

Skinta  Aldrkh 

P892a  SOOnil 

500  ml 

1222  00 

128.13 

NaOH  pellers 

dry 
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il 
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500  (1 

153  77 

11?  90 

Mk  rose  ope 
Slides 

solid 

60 

si  kin 
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17  518  lOOA 

Gcddseal  hranrt.  CM  No  1010, 

3*  4  1‘,  1mm  thkk,  ran  also  huy 
as  case  c/  25  »  144  for  1639  07 

144  slides 
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113.31 
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dry;  malic 
slucks  (d 

0.  Inmol/nl 
in  iXSSt. 

4  07 

nnK>»« 
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N/A 

HPLC  pwdfied;  l.Oixnol  starting 
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average.  25  Inmoies  yield,  5  ■ 

ACC  K  (  .  C  A  ACI.  ICI  lilA  i  ;(. 
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117.0(1 
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40pmol/id 

0.01 

s  of 
fAOi 

Invitrogen  /  lllimiina 

N/A 

ordered  50oniol  stArhno 
synthesis  scale,  corKenUatlon 
txiriiMlued  to  40pnx)l/ul.  made 

Intn  aliquot  plates  of  lOul  whkh 
were  shipped  dry,  with  the 
remakrdeis  shipped  as  kgoid  for 
ailqixitting  here  5o  eOOpmoles 
per  nllqiiol  plate,  x  1 0  plates 
(conservatively?)  -  40(X)  pmules 
-  4  nmoles 

4  EACH 

142,000.00 

1796  47 

I  Merhyl  7- 
iryrrulidiminr 

MiliydrcMis. 

QU.SSfc 

ins-7 

n«K 

Sruma  AMisch 

M6/67  II 

lOIK)  nils 

1T7  85 

118  78 

siKCMik 

Miliydiiiic 

diy 

17.29 

a 

SiwiMi  Aklrkti 

239690  500 

50  q 

125.60 

18.85 

ftork  arid 

dry 

10 

il 

e.q.  Sknna-AldTlrh 

86768-1X0 

1000  q 

175  20 

10  25 

liOH 

955G 

1610 

c.a.  stockrxMnii 

do  not  make  horn  1 00%l 

3/85  nds 

112.50 

15  38 

PMS 

IX 

36.9 

iiilv 

e.p.  stockriHHii 

mlv 

♦0.4; 

V).27 


i0  22 


iO.  ^8  to.  m  ror  (I  ftOiti  i>M  li>4 


14  94  16  7?  rnr  n  6lt(i<  OM- Ink 

10.11 


10.15 
10  00 
10.09 


H^lohM'rpiiiwn 

(MA 


PIT 

MM 


Printer  B 


NnIK  01 
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{.v  ullV<» 
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(.v5  dvc 

M-dUTP 


•<NIP  shMks 
»C 
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It.  pH  8.U 


<irv 

40  PHivl/pl 
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llU/ul 
bPiiiM 
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5  ll/iil 
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ilrv 
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lOOiiiM 
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pore 
IX 
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(>0  iiK 

74  Ills 
45  Ills 

W  iiU 
16  Ills 
4011 

170  lllv 


1 70  iiK 
40.  J  3  Ills 
bOO  PU 
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5.5  liQ 

14.1  Ids 

hi./  Ills 
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Airc 
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your  lAvorlle.  r.g.  tiir  rlienp  sliiH  Iroin 
c..il.  Shmim,  nr  slnihriMHii  S4797 
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J* 
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Appendix  2 


A  Primer  on  Microarray  Design 
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Appendix:  Primer  on  aspects  of  Microarray  Design 

With  specific  reference  to  techniques  used  in  microbial  ecology  microarrays 
From  proposal  defense  paper,  2004. 

I.  Array  Creation: 

(A)  Type  of  Probe:  Most  microarray  studies  characterize  expression  in  specific 
organisms,  and  the  majority  of  those  arrays  consequently  use  immobilized  cDNAs  as 
probes  ("probes"  are  the  DMAs  spotted  onto  the  array,  while  "targets"  are  the 
labeled  complementary  nucleotides  in  the  sample  being  queried).  Of  the  limited 
research  employing  microarrays  to  examine  diversity  (i.e.,  the  presence/  absence/ 
relative  abundance  of  probe  sequences  in  a  given  environment)  rather  than 
expression,  there  are  in  general  three  types  of  probes  that  have  been  used:  PCR 
products  (Wu  et  al.,  2001;  Cho  and  Tiedje,  2002),  short  oligonucleotides  (Loy  et  al., 
2002;  Bodrossey  et  al.,  2003),  and  long  oligonucleotides  (Taroncher-Oldenburg  et 
al.,  2003). 

Two  issues  to  consider  in  choosing  probe  type  are  specificity  and  sensitivity.  Short 
probes  are  generally  more  specific  but  less  sensitive  than  long  probes.  This  becomes 
intuitive  in  the  context  of  hybridization  kinetics;  a  long  probe  (several  kb)  will  be  less 
specific  to  a  given  target  than  a  short  probe  because  of  increased  cross¬ 
hybridization.  However,  signal  intensity  Increases  linearly  with  probe  length;  Wu  et 
al.  (2001)  tested  PCR  products  of  varying  sizes  and  found  a  linear  increase  in  signal 
up  to  1.4kb,  using  pure  culture  targets.  The  longer  a  probe  is,  up  to  a  point,  the 
more  labeled  target  can  hybridize  to  it,  and  the  greater  the  signal  intensity. 

This  relationship  between  probe  length,  specificity  and  signal  has  been  described 
mathematically  (e.g.,  Greene  and  Voordouw,  2003) 

I(x)  =  k(x)*c(x)*f(x) 

where  the  hybridization  intensity  [I(x)]  for  a  given  spot  equals  the  hybridization 
constant  [k(x)]  of  the  probe  sequence  times  the  amount  of  probe  DNA  [c(x)]  spotted 
on  the  filter  times  the  fractional  amount  (wt/wt)  of  the  target  sequence  [f(x)]  within 
the  community  DNA.  The  hybridization  constant  is  specific  to  a  given  sequence,  as  it 
is  proportional  to  G-C  content  and  length  but  also  depends  upon  the  precise 
sequence  of  bases. 

To  counter  the  confounding  effects  of  cross-hybridization  when  dealing  with  complex 
natural  communities,  it  is  best  to  use  relatively  short  probes.  In  addition,  by 
choosing  probes  of  uniform  length  and  with  approximately  the  same  G-C  content,  we 
can  choose  a  hybridization  temperature  roughly  appropriate  to  the  entire  array;  PCR 
product-based  arrays  can  be  more  complicated  in  their  interpretation  because  of 
their  length  and  sequence  heterogeneities. 

Very  short  probes  (in  the  range  of  18-30mers)  have  their  own  limitations.  They  seem 
to  have  poorer  hybridization  properties  than  slightly  longer  probes  (Hughes  et  al., 
2001);  this  may  be  because  they  are  too  close  to  the  surface  of  the  array  causing 
hybridization  to  be  physically  hindered.  For  this  reason,  some  investigators  have 
inserted  spacers  (Loy  et  al.,  2002;  Bodrossey  et  al.,  2003),  while  others  have 
increased  the  length  of  the  oligonucleotide,  to  provide  a  spacer  region  which  can  also 
be  involved  in  hybridization,  thereby  potentially  increasing  sensitivity  as  well 
(Hughes  et  al.,  2001;  Taroncher-Oldenburg  et  al.,  2003). 

The  maximum  length  of  oligonucleotide  synthesis  with  high  accuracy  is  70nts.  Due  to 
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the  increase  in  sensitivity  and  accessibility  of  these  longer  oligonucleotide  probes 
compared  to  very  short  probes,  and  their  better  specificity  when  compared  to  longer 
PCR  products,  I  will  be  using  70-mers  for  our  array. 

(B)  Printing:  There  are  several  general  options  for  creating  the  microarrays.  Short 
oligonucleotide  microarrays  such  as  those  made  by  Affymetrix  can  be  made  through 
a  photolithographic  process  (like  computer  microchips),  although  recently  they  have 
also  been  synthesized  In  place  using  an  Ink-jet  printer  to  arrange  and  control  the 
chemistry  (Hughes  et  al.,  2001).  PCR-product  microarrays  and  those  made  with 
long-oligonucleotides  are  usually  spotted  with  an  "arrayer"  robot.  For  the  description 
of  the  design  and  construction  of  such  an  arrayer,  please  see  Eisen  and  Brown 
(1999).  The  Delong  lab  will  have  its  own  arrayer  arriving  this  fall  to  the  new  lab  at 
MIT. 

The  probes  are  suspended  in  a  buffer  during  spotting,  and  the  nature  of  this  buffer 
can  effect  both  the  success  of  the  print  run  (clogging  of  the  robot's  arraying  pins, 
etc.)  and  the  morphology  of  the  resulting  probe  spots.  While  the  majority  of 
microarrays  have  been  printed  using  3X  SSC  as  the  printing  buffer,  several  studies 
have  shown  that  50%  DMSO  provides  better  quality  printing.  Spotting  short 
oligonucleotides  attached  to  a  spacer,  Bodrossy  et  al.  (2003)  found  that  50%  DMSO 
provided  lower  standard  deviation  between  replicate  spots,  and  dried  out  more 
slowly  during  the  spotting  protocol,  than  3X  SSC.  Spotting  PCR  products,  Wu  et  al. 
(2001)  found  the  same  difference,  with  50%  DMSO  providing  better  signal  intensity 
and  spot  homogeneity,  and  lower  evaporation  during  printing.  Others  have  used 
betaine  as  the  printing  buffer  to  similarly  decrease  evaporation  during  spotting  (A. 
Gracey,  personal  communication).  The  reason  that  slow  drying  is  desired  during  a 
print-run  is  because  using  a  fully-aqueous  buffer  dries  quickly  and  unevenly,  leading 
to  poor  spot  morphology.  For  the  prototype  array  I've  been  using  3X  SSC,  to  match 
the  lab  whose  arrayer  we've  used,  but  in  the  fail  I  may  experiment  with  different 
printing  buffers. 

(C)  Post-processing:  In  general,  once  a  microarray  Is  printed,  depending  on  the  type 
of  probe  and  the  type  of  slide  used,  It  may  need  to  be  cross-linked,  blocked,  the 
spotted  DNA  may  need  to  be  denatured,  and  then  the  microarray  must  be  dried. 
These  steps  are  neither  Interesting  nor  particularly  controversial,  and  so  I  will  not  go 
Into  any  details.  It  Is  likely  that  I  will  follow  the  protocols  at  microarrays.org, 
although  appropriate  post-processing  will  depend  on  choice  of  printing  buffer. 

II:  Target  Preparation: 

The  target  sequences,  those  complementary  to  the  probe  and  present  in  the 
environmental  mix  being  queried,  can  be  prepared  in  a  variety  of  ways. 
Considerations  include  whether  any  amplification  step  will  occur,  what  type  of 
fluorophore  should  be  used  for  visualization,  and  the  method  of  attaching  the 
fluorophore  to  the  target. 

A  key  problem  In  existing  microarray  research  on  microbial  communities  is  the  high 
limit  of  detection.  Several  groups,  using  several  different  probe  types  and  target  DNA 
preparation  methods,  have  found  that  to  be  detected  by  its  probe  a  target  must  be 
present  at  >  lOpg  of  DNA  -  assuming  a  genome  size  of  around  5Mbp,  with  a  gene  of 
around  lOOObp,  this  means  a  species  must  represent  >5%  of  the  DNA  in  the 
community  for  its  genes  to  be  detected  (e.g.,  Taroncher-Oldenberg  et  al.,  2003; 
Bodrossey  et  al.,  2003;  Cho  and  Tiedje,  2002).  Bodrossey  etal.  (2003)  used  a  short 
oligonucleotide  array,  and  amplified  the  gene  of  interest  from  their  target  DNA  pool. 
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In  contrast,  Cho  and  Tiedje  (2002)  used  longer  (500-900bp)  PCR  products  as  probes 
but  did  not  amplify  their  extracted  community  DNA.  Taroncher-Oldenburg  et  al. 
(2003)  found  the  same  detection  limit  of  lOpg  of  target  DNA  using  long 
oligonucleotide  probes  and  PCR  amplifying  target  DNA.  While  this  relatively  high 
detection  limit  leaves  microarrays  useful  for  mapping  the  distribution  of  dominant 
species  in  a  system,  we  know  that  numerically  rare  species  can  play  important  roles 
in  community  dynamics  and  biogeochemical  cycling  (e.g.  nitrogen  fixers).  Cho  and 
Tiedje  (2002)  propose  several  possible  solutions  to  this  issue  of  detection  limitations: 
1)  increase  the  amount  of  probe  immobilized  on  the  array;  2)  enrich  the 
environmental  samples  for  the  genomes  or  genes  of  interest;  and,  3)  achieve  higher 
sensitivity  in  signal  detection.  Solution  (1)  is  dependent  upon  the  probe  spotting 
pins  used  during  printing  of  the  array,  and  will  not  be  discussed  here.  Attempts  at 
solutions  (2)  and  (3)  are  discussed  below. 

(A)  To  amplify  or  not:  In  studies  directed  to  specific  functional  groups,  the  sample  is 
often  enriched  for  the  target  sequences  (solution  2  above).  PCRs  are  commonly 
performed  on  the  community  DNA  using  primers  specific  to  the  gene(s)  of  Interest 
(e.g.,  Bodrossey  et  al.,  2003).  While  this  increases  the  effective  sensitivity,  it  can 
also  skew  the  relative  abundances  of  different  sequences  due  to  differential 
amplification  through  PCR  (a  well-documented  limitation  of  PCR  -  see,  for  example, 
Suzuki  &  Giovannoni  1996).  In  addition,  it  limits  the  possible  targets  to  those 
amplified  by  primers  designed  based  on  sequences  already  in  the  database. 

Another,  broader  approach  is  to  use  random  amplification  of  the  target  DNA.  This 
can  be  a  powerful  way  to  increase  the  effective  sensitivity  by  increasing  the  entire 
pool  of  target  DNA  without  biasing  to  specific  known  sequences.  However,  during  any 
primed  amplification  process  there  will  be  heterogeneity  in  both  the  binding 
efficiency  of  the  primers  and  the  polymerization  efficiency,  depending  on  the  local 
structure  and  sequence  involved.  This  will  create  an  unpredictable  distortion  in  the 
relative  abundances  or  different  possible  amplicons  (Suzuki  &  Giovannoni  1996).  For 
this  reason,  several  techniques  for  amplifying  target  in  a  uniform  way  have  emerged, 
though  it  is  not  clear  yet  which  technique  is  consistently  most  robust  across  studies. 
This  fall  I  will  be  experimenting  with  amplification  of  small  DNA  amounts  to  assess 
which  amplification  method  is  best  for  our  application. 

Due  to  PCR's  Inherent  potential  bias  and  stochasticity,  the  ideal  would  therefore  be 
to  avoid  PCR-based  amplification  of  the  target  altogether.  One  possible  solution  is  to 
collect  more  target  DNA  during  the  sampling  process,  obviating  the  need  for 
amplification.  While  organismal  or  soil-based  studies  are  limited  in  the  quantity  of 
DNA  that  is  practical  to  collect,  aquatic  research  can  employ  filtration  to  greatly 
increase  DNA  yields.  Using  tangential  flow  filtration  the  DeLong  lab  has  previously 
collected  sufficient  water  column  DNA  to  create  BAG  libraries  (Beja  et  al.,  2000)  and 
in  microarray  experiments  the  same  technique  can  be  used.  The  typical  tangential 
flow  procedure  concentrates  500L  of  glass  fibre  pre-filtered  seawater  into  a  final 
resuspension  volume  of  O.SmIs  (Beja  et  al.,  2000),  which  represents  an  1000-fold 
concentration  of  the  community  DNA.  Therefore  because  the  marine  microbial 
habitat  is  amenable  to  concentration  of  cells,  a  good  strategy  may  be  to  avoid 
amplification  entirely.  However,  one  of  the  goals  of  microarray  development  is  to 
allow  sampling  at  small  spatial  and  temporal  scales,  and  from  a  practical  standpoint 
ship  time  limitations  will  mean  that  samples  will  be  collected  during  cruises  that  have 
other  primary  goals.  For  this  reason,  it  will  not  be  practical  to  concentrate  large 
amounts  of  water  during  every  sampling  effort.  Critical  locations  may  be  periodically 
sampled  Intensively  by  collecting  large  numbers  of  cells,  for  extraction  and  labeling 
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without  amplification,  to  validate  the  chosen  amplification  technique,  but  this  cannot 
be  the  standard  collection  method. 

Some  studies  have  successfully  queried  un-amplified  extracted  community  nucleic 
acids,  without  collecting  large  volumes  of  sample.  Small  et  al.  (2001)  used  Ipg  of 
extracted  community  RNA  per  slide  and  could  reproducibly  assign  presence  or 
absence  of  the  targeted  groups,  although  they  were  unable  to  reliably  quantify  their 
targets.  Practically,  without  dramatic  increases  in  sensitivity  through  detection 
abilities,  some  form  of  amplification  is  likely  to  remain  necessary.  To  increase  the 
sensitivity  of  the  detection  of  hybridized  target  (solution  3  in  the  preceding 
discussion  of  detection  limits),  one  can  improve  the  visualization  method  and/or  the 
fluorophore. 

(B)  Choice  of  fluorophore:  The  next  consideration  regarding  the  target  DNA  is  which 
fluorophore  should  be  used.  The  fluorophore  is  the  fluorescent  molecule  that  is 
attached  to  the  target  DNA,  so  that  the  targets  hybridization  to  any  of  the  probe 
spots  on  the  array  can  be  visualized.  Issues  surrounding  fluorophore  choice  include 
the  relative  Intensity  of  the  fluorophore,  its  susceptibility  to  bleaching,  and  to 
quenching.  The  vast  majority  of  microarray  studies  use  the  rhodamine-derivative  Cy 
dyes,  Cy3  and  Cy5.  However,  several  recent  studies  have  suggested  that  Alexa  dyes 
(Molecular  Probes)  may  be  better.  Increasing  sensitivity  by  2-3-fold  (Appendixl  Fig 
1;  and,  e.g.,  DeRisi,  2003).  There  are  seven  different  Alexa  dyes,  one  of  which  can 
already  be  bought  in  the  esterified  form  (see  next  section  for  why  this  is  required). 
The  Alexa  dyes  are  less  effected  by  pH  and  more  resistant  to  photobleaching 
compared  to  the  Cy  dyes  (Fig.  1;  and  DeRisi,  2003).  A  caveat  when  attempting  to 
reproduce  results:  researchers  have  found  that  different  Cy  dye  batches  can  have 
quite  different  levels  of  sensitivity  (Wu  et  al.,  2001). 

(C)  Labeling  the  target  with  the  fluorophore:  Once  the  appropriate  fluorophore  has 
been  chosen,  the  next  step  is  the  labeling  of  the  target  DNA.  There  are  two  types  of 
protocols  for  labeling.  The  first  Is  "direct"  labeling,  where  the  fluorophore  is 
conjugated  directly  to  one  of  the  nucleotides  used  in  a  replication  or  transcription 
step  of  the  target  preparation.  The  second  approach  Is  "Indirect"  labeling,  which 
Incorporates  a  non-labeled  but  modified  nucleotide  in  the  replication  or  transcription, 
which  is  then  conjugated  to  the  fluorophore  after  the  polymerization  reaction.  Direct 
labeling  is  faster  and  simpler,  however  the  incorporation  efficiency  of  the  labeled 
nucleotides  Is  lower  than  for  unlabelled  nucleotides  (DeRisi,  2003).  Indirect  labeling 
avoids  the  problems  of  differential  incorporation  of  the  Cy3-  and  Cy5-labeled  dUTPs, 
gives  a  lower  background  fluorescence,  and  increases  sensitivity  (Dennis  et  al., 

2003;  DeRisi,  2003).  In  indirect  labeling,  amino-allyl  dUTPs  are  used  In  the 
polymerization  step.  The  products  are  then  conjugated  to  an  N-hydroxylsuccinamidyl 
ester  form  of  the  desired  fluorophore.  As  opposed  to  dye-labeled  dNTPs,  the 
incorporation  of  amino-allyl  dUTPs  is  approximately  the  same  as  that  of  unmodified 
dUTPs  (DeRisi,  2003). 

To  prevent  secondary  structure  from  forming  in  the  target  sequence  and  interfering 
with  Its  hybridization  to  probes,  the  labeled  target  Is  often  fragmented.  While  many 
groups  use  labeled  DNA  as  the  target,  RNA  can  be  chemically  fragmented  in  a 
random  manner  (Bodrossy  et  al.,  2003).  In  addition,  for  those  using  a  direct-labeling 
approach,  the  Incorporation  of  Cy-labeled  nucleotides  into  RNA  is  more  efficient  than 
it  is  in  DNA.  For  these  reasons,  in  vitro  transcription  has  been  used  during  target 
preparation  (e.g.,  Bodrossey  et  al.,  2003).  However,  even  chemical  fragmentation  of 
RNA  may  not  always  provide  uniformly  small  pieces  -  longer  transcripts  may  not 
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may  not  fragment  thoroughly;  Koizumi  et  al.  (2002)  had  difficulty  with  some  of  their 
probes  never  showing  signal  even  when  they  knew  there  was  a  perfect  match 
transcript  in  the  target  mix,  and  they  suggested  that  one  factor  responsible  was 
incomplete  fragmentation  of  the  target. 

Ill:  Hybridization: 

As  with  more  traditional  blot-based  hybridization,  hybridization  to  microarrays  is 
affected  by  a  range  of  factors.  The  stringency  of  hybridization  is  affected  both  by  the 
conditions  during  the  hybridization  itself,  and  in  the  subsequent  wash  steps.  The 
temperature  used  for  hybridization  cannot  be  empirically  tailored  to  each  individual 
probe  when  there  are  hundreds  or  thousands  of  probes  on  the  same  substrate, 
although  when  designing  short  probes  of  uniform  length  it  is  possible  to  select  for 
probes  with  a  close-to-uniform  melting  temperature  (within  a  narrow  range). 
Although  some  researchers  do  empirically  hone  their  hybridization  temperature  (  Wu 
et  al.,  2001),  many  use  a  low-  or  even  room-temperature  hybridization  and  rely  on 
wash  steps  to  achieve  the  desired  specificity.  Wash  buffers  typically  include  a 
detergent,  such  as  SDS,  and  ionic  components  such  as  SSC  (which  is  made  from 
sodium  chloride  and  sodium  citrate),  both  of  which  stabilize  the  hydrogen  bonding 
interactions  of  hybridization.  By  decreasing  ionic  or  detergent  concentrations  in  the 
wash  buffer  and  by  increasing  the  temperature  or  duration,  the  wash  stringency  can 
be  increased.  For  this  reason,  several  groups  have  invested  significant  effort 
empirically  determining  the  most  appropriate  wash  conditions  for  their  microarrays 
to  achieve  the  best  trade-off  between  specificity  and  sensitivity  (Wu  et  al.,  2001; 
Koizumi  et  al.,  2002;  Taroncher-Oldenburg  et  al.,  2003). 

The  most  appropriate  hybridization  and  washing  conditions  for  a  given  array  will 
depend  not  only  on  the  specific  probes  and  target  involved,  but  also  on  the  design  of 
the  array  and  on  questions  being  asked.  Has  the  array  been  created  with  some 
redundancy  in  probes,  so  that  there  is  more  than  one  probe  for  a  given  gene  or 
organism  of  interest?  If  there  is  some  redundancy,  then  It  may  be  less  crucial  to 
prevent  closely  related  target  sequences  from  binding  probes,  since  the  degree  of 
similarity  between  two  different  species'  genes  or  genomes  will  vary  with  region  -  as 
was  demonstrated  graphically  in  the  Wang  et  al.  (2002)  virochip  viral  "barcoding" 
results.  In  addition,  the  questions  addressed  may  focus  more  on  family-  and  genus- 
level  changes  in  community  composition,  rather  than  changes  In  single  species.  Even 
within  species,  different  strains  can  have  different  genes  and  different  possible 
niches  (REF);  using  multiple  probes  specific  to  a  given  species  or  strain,  it  is 
becoming  possible  to  use  microarrays  to  examine  microheterogeneity  of  strains  or 
closely  related  species  (dubbed  "genomotyping")  and  to  pick  out  evidence  for  lateral 
gene  transfer  (Murray  et  al.,  2001;  Urakawa  et  al.,  2003).  Thus,  the  appropriate 
degree  of  stringency  will  depend  entirely  on  the  microarray  Involved  and  on  the 
questions  being  asked. 

If  one  were  addressing  questions  that  depended  on  precise,  incontrovertible 
identification  of  perfect  probe-match  in  the  target  mix,  would  it  possible  to  achieve  a 
sophisticated  enough  level  of  resolution  to  resolve  perfect  matches  and  single-base- 
pair  mismatches?  Current  research  Indicated  that  it  is  not  possible  for  every  case. 
While  the  level  of  discrimination  being  created  by  the  hybridization  and  wash 
conditions  should  be  tailored  to  the  questions  being  asked,  it  does  not  appear 
possible  to  ensure  that  all  probes  on  an  array,  or  even  any  given  probe,  only 
produces  signal  from  a  perfect  match.  However,  it  is  possible  to  use  hybridization 
and  wash  conditions  to  effectively  remove  double-  and  greater  mismatches.  With 


230 


optimization,  one  can  also  achieve  exclusion  of  internal  single-base-pair  mismatches. 
However,  for  some  probes  terminal  or  penultimate  single-base-pair  mismatches 
hybridize  as  well  or  even  slightly  better  than  perfect  matches  (for  example,  see 
Urikawa  et  al.,  2003;  Taroncher-Oldenburg,  2003).  Stahl's  group  regularly  collects 
melting  profiles  for  all  the  spots  on  their  microarrays,  in  order  to  analyze  dissociation 
temperatures  of  targets  from  probes.  Recently  they  trained  a  neural  network  to 
inspect  the  signal  data  from  their  microarray  at  a  given  optimal  discrimination 
temperature,  determined  by  the  melting  profiles,  and  make  the  judgment  of  whether 
or  not  a  given  signal  represented  a  perfect-  or  mis-match.  The  R2  for  the  ability  of 
the  neural  network  to  correctly  call  a  perfect  match  from  a  mismatch  based  on  signal 
Intensity  was  only  0.70  (Urakawa  et  al.,  2003).  It  seems  that  this  is  not  a  limitation 
of  the  neural  network  analysis,  but  rather  an  oddity  of  the  hybridization  kinetics  of 
certain  mismatches,  and  therefore  will  likely  not  be  surmountable  by  improved 
analytical  tools.  However,  the  good  news  is  that  by  creating  an  array  with 
redundancy,  one  can  safeguard  against  misinterpretation  based  on  a  single  faulty 
data  point.  For  studies  specifically  of  microheterogeneity,  this  internal-single-base- 
pair  mismatch  discrimination  represents  the  current  limit  of  discrimination. 

Hybridization  of  microarrays  is  still  poorly  understood.  In  Loy  et  al.'s  (2001) 
microarray  studying  sulfate-reducing  bacteria  they  saw  up  to  a  dramatic  56-fold 
difference  in  the  signal  intensity  of  perfect  matches  among  their  136  probes.  They 
suggest  that  this  difference  may  be  due  either  to  secondary  structure  in  the  labeled 
target  DNA  or  to  steric  hindrance  from  hybrids  formed  on  the  array  during  the 
hybridization  process.  Fora  detailed  discussion  of  the  hybridization  behavior  of 
oligonucleotides  in  the  context  of  microarrays,  see  Bodrossey  at  al  (2003). 

When  using  microarrays  to  track  gene  expression,  investigators  are  looking  for 
differences  in  signal  among  different  stages  or  cell  types,  and  so  will  competitively 
hybridize  target  from  both  or  from  a  range  or  conditions  in  relation  to  one  "standard" 
condition.  With  non-expression  microarray  studies,  competitive  hybridization  has 
continued  to  be  used,  because  absolute  quantification  through  hybridization  is  not 
possible  and  so  some  form  of  relative  quantification  must  be  used.  With  longer  PCR 
products  as  probes,  Cho  and  Tledje  (2002)  used  lambda  DNA  in  their  microarray 
design.  By  spotting  equal  amounts  of  lambda  DNA  to  probe  in  each  spot  on  the 
array,  they  could  then  spike  their  target  with  lambda  DNA,  labeled  with  the  other 
fluorophore.  This  provided  an  internal  standard  to  quantify  each  spot's  signal  in 
relation  to,  and  also  allowed  for  normalization  across  the  array  for  differences  in 
spotting  or  hybridization  efficiency,  as  well  as  representing  a  positive  control.  With 
longer  PCR  products  as  probes  this  is  a  smart  approach,  because  the  hybridization 
kinetics  of  the  probe  and  the  lambda  DNA  will  be  reason  ably  similar  when  averaged 
over  their  entire  length,  allowing  the  lambda  DNA  to  act  as  a  standard  for  whatever 
probe  is  being  used.  However,  with  oligonucleotide  probes,  the  hybridization  kinetics 
can  be  markedly  different  depending  on  the  precise  sequence.  So  unfortunately, 
lambda  DNA  would  not  be  a  meaningful  internal  standard  on  an  oligonucleotide  array 
(Bodrossey  et  al.,  2003). 

An  Interesting  but  very  labor-intensive  approach  used  by  Bodrossey  et  al.  (2003)  in 
their  oligonucleotide  array  was  to  use  as  a  reference  an  artificial  mixture  of  the  PCR 
products  represented  by  their  short  oligonucleotide  probes.  They  would  first  do  a 
one-color  hybridization  of  the  community  DNA  of  interest  to  their  array  to  get  an 
idea  of  which  sequences  were  present,  and  their  rough  relative  abundances.  Then 
they  would  then  make  up  an  appropriate  reference  mix  of  those  sequences  present 
using  the  appropriate  PCR  products,  in  a  known  ratio,  and  use  that  to  competitively 
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hybridize  against  the  target  community  DNA  to  refine  their  quantification  of  relative 
abundances.  While  this  seems  tractable  in  a  smaller  microarray  (they  had  only  59 
probes)  one  can  imagine  it  becoming  unwieldy  quickly  as  the  number  of  probes  on 
the  microarray  increased.  Bodrossey  et  al.  (2003)  acknowledged  the  limitations  of 
this  reference  DNA  approach,  since  it  requires  a  reference  set  as  similar  as  possible 
to  the  sequences  present  in  the  target.  This  limits  the  utility  of  this  approach  to,  for 
example,  studies  looking  at  a  single  community  over  time  or  under  different 
conditions. 

However,  for  our  BAC-derived  microarray,  there  may  be  a  way  to  competitively 
hybridize  a  reference  set  for  less  precise  quantitative  purposes.  By  amplifying, 
labeling,  and  fragmenting  the  BAC  inserts  used  for  the  creation  of  the  array,  it 
should  be  possible  to  develop  a  standard  reference  mix  to  be  used  identically  in  all 
hybridization  reactions.  It  has  been  previously  suggested  that  the  ideal  reference  for 
a  complicated  sample  is  equal  portions  of  each  component  mixed  together  (Eisen 
and  Brown,  1999).  This  would  provide  relative  quantification  for  all  samples  targeted, 
in  relation  to  this  reference  mix,  and  would  us  Indirectly  compare  multiple  water 
samples  taken  at  differing  times.  For  comparisons  with  other  labs  in  the  long  run,  it 
would  be  ideal  to  develop  a  more  precise  means  of  standardization.  (To  be  clear, 
hybridization  depends  not  only  on  the  target's  absolute  quantity  but  also  on  Its 
relative  abundance  in  the  total  target  DNA,  which  might  be  quite  different  than  that 
in  the  reference  mix.  This  is  why  creating  a  reference  mix  with  the  same  components 
present  and  In  the  same  relative  proportions  as  the  target  is  the  best  way  to  actually 
get  the  most  precise  quantification  -  but  this  Ideal  reference  mix  will  change  over 
time,  with  the  community,  and  so  is  not  a  practical  solution  for  ecological  studies.) 

V:  Data  Analysis: 

this  section  is  just  a  brief  introduction  to  a  few  of  the  considerations 
surrounding  microarray  data  analysis,  and  will  be  expanded  in  the  future. 

Several  studies  have  compared  the  results  of  microarray  analyses  to  standard 
methods  of  assessing  microbial  diversity  and  relative  abundances,  such  as  PCR- 
DGGE  (Koizumi  et  al..,  2002),  sequencing  of  PCR  product  clone  libraries  (Loy  et  al., 
2002),  and  Northern  blots  (Koizumi  et  al.,  2002).  In  general,  these  studies  have 
found  a  good  agreement  among  techniques,  with  the  caveat  that  microarrays  have  a 
comparatively  high  detection-limit  (see  start  of  Materials  and  Methods). 

An  important  consideration  is  how  to  decide  whether  to  count  a  probe's  signal  as 
"on"  or  not,  or  in  our  case  "present"  or  not.  Many  different  approaches  to  microarray 
data  standardization  have  been  explored  (e.g.,  Dennis  et  al.,  2003).  Many 
researchers  use  an  arbitrary  cut-off  for  defining  a  signal  as  "on",  for  example  in  Loy 
et  al.  (2002)  the  cut-off  for  considering  a  given  spot  "positive"  was  if  its  signal-to- 
noise  ratio,  calculated  using  their  unique  formula,  was  greater  than  2.0. 

Once  the  number  of  spots  exceeding  the  defined  cut-off  signal  intensity  has  been 
determined,  the  next  step  is  to  interpret  the  remaining  data.  To  date,  several 
approaches  exist  to  ordering  the  data  to  interpret  meaning,  look  for  patterns,  and 
assess  the  significance  of  differences  seen  among  treatments. 

Hierarchical  clustering  is  a  common  tool  for  looking  at  microarray  data  and  can 
reveal  informative  patterns  (Brown  and  Botstein,  1999).  For  example,  in  the  results 
of  an  experiment  using  the  BAC-based  microarray  proposed  here,  all  the  probes  from 
a  given  BAC  might  cluster  together,  implying  the  presence  of  that  genome  or  its 
close  relatives  in  the  sample.  Alternatively,  the  probes  to  several  homologues  of  a 
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given  operon  from  several  different  BACs  might  cluster  together  -  this  could  be 
Interpreted  in  two  ways:  either  the  operon  has  a  very  high  degree  of  conservation,  to 
the  exclusion  of  the  rest  of  the  host  genomes,  or  there  are  novel  genomes  present 
which  contain  a  highly  conserved  version  of  that  operon.  This  example  shows  how 
important  probe  design  can  be  -  In  highly  conserved  genes,  it  may  be  appropriate  to 
include  two  different  probes,  one  in  the  heart  of  the  conserved  region,  and  one  in  the 
most  divergent  region.  While  hierarchical  clustering  Is  an  extremely  useful  tool, 
many  have  reservations  about  it  because  of  the  sheer  volume  of  data  involved  in 
microarrays  studies.  Clustering  can  be  unreliable  when  dealing  with  so  much  data 
because  It  is  often  impossible  to  achieve  high  bootstrap  values  (Tilstone,  2003;  and 
K.  Pollard,  personal  communication). 

A  true  statistical  analysis  of  microarrays  is  a  difficulty  problem,  and  a  single  solution 
has  not  been  embraced  in  the  community.  An  add-in  to  Excel  has  been  developed 
called  SAM,  the  Significance  Analysis  of  Microarrays,  which  Is  superior  to  a  t-test  at 
the  low  replication  numbers  typical  of  microarray  studies  (Piper  et  al.,  2002).  Several 
groups  are  working  on  robust  tools  for  statistically  analyzing  microarray  data, 
including  Duke  University's  CAMDA,  the  Critical  Assessment  of  Microarray  Data 
Analysis  (www.camda.duke.edu). 
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Appendix  Figure  1,  Comparison  of  Alexa  and  Cy  dyes  for  labeling  targets  in 
microarray  hybridizations  (figures  taken  from  Molecular  Probes'  website) 

A.  They  have  similar  absorption  and  emission  spectra 


B.  But  look  at  the  emission  intensity 


C.  Alexas  also  have  less  bleaching  over  time,  less  sensitivity  to  pH  change,  and  less 
quenching  as  the  number  of  dye  molecules  per  target  increases 
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We  investigated  the  diversity  of  methane-oxidizing  bacteria  (i.e,,  methanotrophs)  in  an  annual  upland  grass¬ 
land  In  northern  California,  using  companitive  se(|nence  analysis  of  the  pmoA  gene.  In  addition  to  identlly  ing 
type  II  methanotrophs  commonly  found  in  soils,  we  discovered  three  nnvel  pmoA  lineages  for  which  no  culti¬ 
vated  members  have  been  previously  reported.  These  novel  pmoA  chides  clustered  together  either  with  clone 
sequences  related  to  “R.\  14”  or  “WB5FH-\,”  which  both  represent  clusters  of  environmentally  retrieved 
sequences  of  putative  atmospheric  methane  oxidizers.  Conservation  of  amino  acid  residues  and  rates  of  non- 
synonyinotis  versus  synonymous  nucleotide  substitution  In  these  novel  lineages  suggests  that  the  pmoA  genes 
In  these  ciades  code  fnr  fimctlnnally  active  methane  monooxygenases.  The  novel  ciades  responded  to  simulated 
global  changes  differently  than  the  type  II  methanotrophs.  W'e  observed  that  the  relative  ubundance  of  type  II 
methanotrophs  declined  In  response  to  Increased  precipitation  and  Increased  atmospheric  temperature,  with 
a  significant  antagonistic  interactinn  befiveen  these  factors  such  that  the  effect  nf  bnth  together  was  less  than 
that  expected  Irnm  their  Individual  effects,  Two  of  the  novel  ciades  were  not  observed  to  respond  ,signilicantly 
to  these  envirnnmentai  changes,  while  one  of  the  novel  ciades  had  an  opposite  response.  Increasing  In  relative 
ubundance  in  response  to  increased  precipitation  and  atmospheric  temperature,  with  u  significant  antagonistic 
intcnictlon  between  these  factors. 


.Methane -oxidizing  bacteria  (methanotrophs)  are  a  unique 
group  of  aerobic,  grani-tiegalive  bacteria  that  use  nielhane  as 
their  sole  source  of  energy.  They  arc  ubiquitous  in  nature  and. 
as  the  major  biological  sink  for  the  greenhouse  gas  methane, 
they  are  involved  in  the  mitigation  nf  global  warming.  Methan¬ 
otrophs  are  also  of  special  interest  to  environmental  microbi¬ 
ologists  because  of  their  capability  to  degrade  various  environ¬ 
mental  contaminants,  their  potential  for  single  cell  protein 
production,  and  other  novel  aspects  of  their  biochemistry  (19). 

Based  on  physiological  and  hitKhemical  characteristics,  cul¬ 
tured  members  of  the  methanotrophs  are  traditionally  divided 
into  two  main  groups:  type  i  methanotrophs,  which  are  mem¬ 
bers  of  the  class  Canimapirneohacteriu  (e,g.,  Metliylonionas, 
Methyiococciis^  Mcthyhmicrohium,  Methyhthenmis^  Methyloha 
lohiuni,  MethyliKalduw.  and  Methylohacter)  and  type  II  me- 
thanidrophs.  whicli  are  in  the  class  AlphaproteohactcrUi  (e.g.. 
Meihvlosinus^  MethyhKella^  Methvlocapsa,  and  Methyl<Kystis) 
(14,15.19). 

However,  this  picture  of  methanotrophic  diversity  has  be¬ 
come  much  more  complex  recently.  The  genera  Methylocella 
and  Methylocapsa,  although  considered  members  of  the  type  ii 
methanotrophs.  are  phylogenetieally  distinct  from  the  classical 
representatives  of  type  ii  methanotrophs  and  differ  physiolog¬ 
ically  In  many  aspects  from  all  other  known  methanotrophs 
(13-16),  In  addition,  methanotrophic  isolates  from  some  Arc- 
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tic  soils  have  been  shown  to  possess  highly  divergent  pmoA 
genes;  this  gene  ciietKles  the  active  site  polypeptide  of  partic¬ 
ulate  methane  monooxygen asc,  a  key  enzyme  in  methane  ox¬ 
idation  (39).  The  pmoA  gene  has  been  used  as  a  molecular 
marker  in  numerous  environmental  studies  of  melhanolroph 
diversify  ( IS.  20.  2S.  36)  and  is  an  ideal  marker  because  it  codes 
for  an  enzyme  that  is  central  lo  methane  oxidation,  is  present 
in  all  known  methanotrophs  (with  the  exception  of  Methylo 
celUi),  and  there  is  no  evidence  of  horizontal  transfer  of  pmoA 
among  methanotrophs  (i.c,.  the  pmoA  phylogeny  is  generally 
consistent  with  the  16S  rRNA-based  phylogeny  of  methano- 
trtiplis)  (12,  36).  Unique  pmoA  gene  sequences  (for  which  no 
Isolates  arc  known)  have  also  lx:cn  identified  in  a  number  of 
cullurc-indcpcndcnt  studies  of  environmental  samples  (4.  21, 
25.  2S,  ,12).  Among  the  most  interesting  of  these  unique  se¬ 
quences  are  those  suggested  to  belong  to  specialized  methan¬ 
otrophs  adapted  to  the  trace  levels  of  nietliuiie  found  in  tlic 
atmosphere  (25.  32). 

For  example,  in  forest  stnls  that  arc  sinks  for  atmospheric 
methane,  novel  pmoA  sequence  types  (the  clade  containing 
type  .sequence  ‘RA  14**)  distantly  related  to  MethyhKupsu  aci- 
diphila  have  lx;eii  deserilx;d  frequently,  providing  evidence  for 
the  existence  of  a  distinct  group  of  “specialized*'  methano 
trophs  (ill.  2i.  25).  It  has  also  been  suggested  that  another 
group  of  methanotrophs  represented  by  a  novel  pmoA  lineage 
(the  clade  containing  type  sequence  **WB5FH-A**)  that  groups 
distantly  lo  lyjx;  I  metluiiiotroplis  might  be  involved  in  atmo¬ 
spheric  methane  consumption  In  some  soils  as  well  (32),  None 
of  these  putative  atmospheric  methane  consumers  has  yet  been 
isolated. 

Consumption  of  atmospheric  methane  has  the  potential  lo 
play  an  impe^rtant  rule  in  climate  change.  Metlume  is  20  lo  25 
times  more  effective  per  molecule  than  CO,  as  a  greenhouse 
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Mi  l iiam:  o\idi/i\c;  hac'H  kia  and  cji  orai  (  uanc;! 


g:is  (5.  44).  C'onsumptiun  of  iiimosphcric  methane  is  eslimaied 
lo  account  for  iiboul  h'"/  (about  30  Tg/Venr)  of  the  global  at¬ 
mospheric  methane  sink  (41 ).  Furthermore,  the  environmental 
changes  associaled  with  greenhouse  scenarios  (e.g..  iiicreaseil 
temperature,  pieeipitation,  and  nttrogeii  deposition)  have  the 
potential  to  interact  with  methane  consumption  and  cause 
posiifve  feedbacks  between  methane  lUi\  and  cltmale  ch.mge 
(31.  .S| ).  These  interactions  have  been  attributed  to  changes  in 
ihe  activity  of  methanotrophs  and/or  alterations  in  the  striie> 
lure  of  the  methanotroph  community  in  response  to  these 
environmental  changes  (31)  However,  it  is  unknown  whether 
realistic  global  changes  have  the  potential  to  alter  the  sti  ueture 
of  the  methanol roph  community. 

We  investigated  Ihe  response  of  soil  methanotrophs  to  sim¬ 
ulated  mullifactortal  global  change,  including  elevaied  atmo¬ 
spheric  CO_..  higher  atmosplicric  temperatures,  increased  pre¬ 
cipitation.  and  increased  nitrogen  deposition,  manipulated  on 
the  eeosyslem  level  in  a  Californian  annual  grassland.  The  aim 
i>f  our  siudy  was  r\voft)ld.  The  first  goal  was  to  assess  the  mc- 
ihanotrophie  diversity  t>f  the  Californian  annual  grassland.  I  his 
was  accomplished  bv  amplifying,  ek>ning.  and  sequencing  pmo.,4 
genes.  Our  sectmd  gi)al  was  lo  monitor  shifts  in  methanolroph 
community  compositton  in  response  to  simulated  global  change. 
This  was  acciimplishctl  by  creating  genetic  community  profiles 
of  metham>tn)phs  from  soils  exposed  to  dillercnt  combinations 
of  simulated  global  changes.  Hiese  pnsfiles  were  based  on  ter¬ 
minal  restrieiion  Iragmeni  length  polymorphism  (T-RFLP)  anal¬ 
yses  o\  pmoA  genes,  an  automated  and  sensitive  approach  that 
has  been  used  for  the  characterization  of  methanotrophs  in 
varit>us  environments  (23.  27,  2S,  4il). 

We  observed  that  our  grassland  stiil  harbored  a  remarkable 
diversity  of  knt>wn  and  novel  pmoA  gene  lypes  and  that  Ihe 
community  structure  of  methanotrophs  in  this  soil  changed  hi 
respirnse  to  simulated  global  change. 

MATtRLXLS  .\ND  MElItDDS 

I'irM  e\p«riin<rnl.  The  iinp.al  of  iiidivid(i;il  .md  nnillipic.  siinulliincoiis  ulohiil 
eh;inges  on  mcth;ino(roph  comniiinity  compisiiion  mjs  mvesligtited  using  (he 
liispcr  Ridgo  Olohal  (  hnnge  Fx|x;rimcnt  (JR(lt<H.  The  JRGC'F.  is  Kscaicd  on 
Ihe  J.nspcr  Ridge  Bioh'igical  Preserve,  which  lies  in  the  c.isteni  foothills  of  rhe 
S.iiita  Cruz  Mountains  in  northern  California.  The  dinialc.  vcgclalion.  and  sen! 
piinimclcrs.  j.s  well  a.s  the  experimental  de.sign.  have  heen  described  in  dct.iil 
prcvioast\*(4.1.  48).  In  hnet.  the  JRtfCE  wa.s  cstnhitshed  in  a  grassland  ecosystem 
doniUKitcd  hy  aiutiial  griisscs  C’t'oni  burNuu  aud  linmius  httninntta)  and  forbs 
((/cniituirii  Juurfiim  .ind  Enkhuni  /v>/n-r).  growing  on  a  sandsinne-denved  viil 
with  ai>  .iverage  pH  of  h..ll  "  lt.1.  Four  global  ch.inge  factors.  CG,  (aoihicnt  iind 
bHtt  ppm),  tempcraiiirc  (ninhicnt  and  anihicnt  plus  80  W  nt  *  of  thermal  radi¬ 
ation).  prccipilation  (ambient  and  atxrvc  ambient),  and  nitrogen  deposition 
(-jmhient  and  amhient  plu.s  7  g  N  m“'  in  the  form  of  calcium  nitrate),  were 
applied  to  ditferent  plots  m  a  lull  l.ictorini  design  (leiidina  to  a  total  ot  t6 
dilTcreni  ircatnicnls).  EiK'h  Ircithiient  was  rcplkated  eiglil  liiuev  The  irealineiits 
were  applied  as  a  split  pi ol  design  with  ^2  circular  pints,  each  divided  into  four 
0,78  m-  quadrants,  sep.ir.ited  by  solid  belowground  and  mesh  abovegnmnd 
partitions.  Infrared  hc.it  lamps  were  suspended  over  tlic  centers  of  the  warming 
plols.  heating  Ihc  planls  in  all  quadrants  of  a  plot  by  0.8  lo  PC.  Atmospheric  CO> 
eoTK'ontraiions  were  elevated  with  a  ring  of  frec-air  emitters  surrounding  the 
plots.  Anihicnt  prccipitatmn  events  were  augmented  with  dnp  irrigation  and 
overhead  sprinkJersi  the  precipitation  trciitniciit  iiKtcnscd  the  average  soil  imiis- 
turc  from  IV  to  26.6'  r  (mea.sured  at  Ihe  time  of  soil  sampling).  Warming  and 
CO^  treatmenl.s  were  .ipplied  on  the  whole-plot  level,  and  preeipiliition  and 
niiiogcn  treatments  were  applied  on  the  suhplor  level.  Manipulations  started  ii> 
lilt  aulitniii  of  1W8.  al  Ihc  beginning  of  cuaslat  Californins  rainy  season. 

Soil  sampling.  The  analysis  of  mierohial  communities  was  initiated  in  May 
2tKKI.  S«*il  enres  Irom  all  replicate  irenlmcnis  were  tnken  hrom  a  depth  of  I.>  cn> 


with  a  2.2-cm-di,imctcr  corcr.  P.ich  core  was  pl.iccd  in  .i  plastic  hag,  cxKilcd  on  ice 
in  the  hcUl.  .mil  honiogenizcil  thoroughly  hy  hand  m  the  lahoratorv  prior  to 
storing  at  Hit'C 

ExtriKiion  Ilf  lotiit  t)N.\.  ExiMctinn  of  DNA  from  t).5  g  nt  -mil  was  performed 
using  the  tllti.i  soil  DNA  cxtiaction  kit  (MiiBio  I  .then. -noi it's.  Nsbn.i  Boaih. 
CA)  .ictording  to  the  m.inufatlurer’s  insiructions,  with  (he  exception  that  the 
final  piitilication  step  was  repeated  to  incre.ise  the  purity  of  the  DNA  The  DN.A 
was  resuspended  in  .n  tin.il  voltinic  of  5(t  pi  .md  stored  .it  Stri .  DN,A  qiiniui- 
fication  was  pcrhirmed  with  the  Pici>Cireen  a.ss.iv  (Molecular  Prohes.  l-ugene. 
OR  I  according  lo  the  maiuifactnrerS  directions  The  DNA  vicid  was  .ippiovi- 
mately  .>  to  211  ng  pi. 

PriiiRT  estiliiiition.  To  rh.iracienzo  the  metli.inotropliic  divorsitv.  we  tested 
live  different  primer  combi nafums  for  their  suilabilirv  to  amplify  pmo.A  gene 
types  ill  Ihe  J.isper  Ridge  grasshind  viils,  For  this  prelmun.iry  lesL  W'e  chose  vul 
s,iinples  from  two  plots  with  cicv.ntcd  CO>.  temperature,  precipitation,  .ind 
nitrogen  (plots  tD.^  and  ID(>U).  l-or  each  soil  and  primer  coiiihiiialion.  one  clone 
library  was  generated,  and  we  sequenced  1.^  clones  per  library  The  primer 
comhinaiions  Icsied  were  (i)  .AI8Vh.fiXJR  (24).  (li)  AI8'Jl--fO(tR  (iti).  (in) 
Aiyt^F-iuWkilR  1 12). (h)  A18‘>F-(>f>2R (sciutnesied;  65nR). and  (v)  Alsar-b82R 
(seminested;  mhohlR),  .All  clones  sequenced  Irom  clone  libraries  gcneraied  hy 
use  of  rhe  AI8‘tF-Yi82R  primer  syMem  were  sequence  txjvseUssoly  rol.itod 
to  Ihc  annnonia-oxidizer  SnrtKSttspint  iviiltibmnn.  Clone  libraries  that  were  gen- 
er.Hcd  based  on  the  AfSitF-b^lR  and  AISOF-mbbbf  R  primer  systems  contained 
sonic  fkntiA  sequences.  However,  up  to  .Sthr  of  the  randomly  selected  clones 
contained  oonspecitk.'  inserts.  In  coiiirasi.  »|]  clones  sequenced  trmii  clone  li¬ 
braries  gcneraied  u.sing  ihc  two  seminested  PCR  approaches.  AI8UF  f>82R 
(seminested;  b.NiR)  and  A18‘>P-b.S2K  (seminested;  mbbblRi,  were  /wrwi. t  se¬ 
quence  types.  Therefore  the  semmesred  PC  R  .ippro.ich  was  siihsequenily  liv'd 
for  the  study  of  methaniurophie  diversity  and  K>r  generating  coninuinity  proliles 
hy  T-RFLP  .imilyNs. 

IM.  R  am|ilifici«tion.  .As  do scnlx'd  above,  the  .iniplihcnrioii  ot/wTu».  I  genes  w.is 
performed  via  a  seiiiineMcd  PCR  approach  using  the  5'  pinner  AIKV  and  the  7' 
primer  Ab82  (24).  The  icmpCTalure  pnilile  I  Table  I)  wus  identical  lo  the  previ 
oiisly  descriKrd  “touch-down  "  PCR  piolixol  (2.H).  Aliquots  ot  the  hrsi  round  ol 
PCR  (0  25  pi)  weie  used  .is  Ihc  template  in  Ifw  secxuid  loinid  of  PC'R  using  (tie 
y  pnmer  Al8d  .ind  the  two  X"  primers  mbbhIR  and  ti.5i)R  in  .i  multiplex  PCR 
setting  (i.c..  Isoth  reverse  primers  wore  present  in  the  sninc  rc.icfioii).  This 
appro«K'h  allowed  simultaneous  .impliticatkm  of  a  hroad  range  of  fminA  targets. 
The  reverse  primer  nihbbl  R  was  designed  for  Ihe  detection  of  type  t  and  lype  tl 
mcrhanoiioplis  (12).  while  ilie  reverse  pnmer  bStiR  was  designed  for  the  specific 
deiecnon  ol  putaiive  atmospheric  meih.ine  oxidizers  from  ihe  “R.A  14“  cfiide 
(tO)  Each  reaclion  iiiixlurc  contained  12.5  pi  of  M.tslerAllip  PCR  prctllLX  F 
(Epicentre  Technologies.  MaJivm.  Wl).  0.5  pM  of  (each)  pnmer  (QIAGEN. 
Alamcd.i.  CA),  t  7^  I J  of  Tm/  DNA  pslynvrasc  low  DNA  ( AinpliT.iq,  Applied 
Biosyslems.  FiMer  City.  CA).  and  112.^  pi  of  rempLiic  DNA.  .Amplilkaticm  was 
performed  in  a  total  volume  of  25  pi  in  0.2-ml  reaclion  tubes,  using  a  DNA 
Engine  ihcrm.il  cycler  (MJ  Rosc.irch.  S.in  Francisco.  CA).  The  PCR  ampliflc.i- 
tions  of  cnvironiiK’nt.il  DNA  resulted  in  amplicoiis  of  the  cXfKctcd  size  (a^sprox- 
imaiely  .5lK)  bp).  The  first  round  reaction  and  the  second  round  reaction  were 
each  pertormed  in  triplicate.  Aliquols  Irom  the  hrst  round  (three  independent 
reactions  in  tliicc  different  lulvs)  were  pooled  Ix'fore  gouig  into  the  scuntd 
round  (which  W'.i.s  itself  done  in  Inplicaie).  These  final  reactions  were  piHiled 
prior  lo  digestion.  AliquoLs  of  the  .implicons  (.5  pi)  were  checked  hy  electni- 
phoresis  on  .i  I'r  ng.irose  gel. 

Ciuning  uikI  sequencing.  PCR  products  were  cloned  using  a  TOPO  TA  clon¬ 
ing  kil  (Inviinigen  Corp..  San  Diego.  CA)  following  Ihe  prnlixol  of  Ihe  manu¬ 
facturer.  1  he  prepiiraiion  of  pliismid  UNA  of  randomly  sciccred  clones.  Pl'R 
aiiipUfkation  of  cloned  iitscTis.  and  noriradiouctive  sequctK'tng  were  carried  oul 
as  desetihed  previously  (28). 

Ph \ luKenel it'  »n:il\his.  The  idconiics  of  the  /w/o.4  gcoo  sequences  were  toi>- 
firmed  by  searching  the  inlcmalional  sequence  da!ab;*ses  using  the  BL/\ST 
program.s  (htlpi/'W'ww.nchi.nlm.nih.gtiv  BLA.ST;).  The  cnirrently  avaikible  d.ita- 
base  of  fknaA  acne  sequences  was  integralcd  w  ithin  the  ARB  program  p.ick.igc 
(.1.^).  and  DNA  sequences  were  .mnlyzed  .and  edited  iLSing  the  alignment  ukiIs 
iniplcmentcd  in  ARB  We  consirtieicxl  phylogenetic  trees  using  llie  tit.t\tiiiuiii 
hkelihiHHl  .ippro.ich  (with  Ihe  def.mlt  settings),  the  Filch  M.irgoliash  appni.ich 
(using  global  rc.irrangcment  amt  r.indoniizcd  inpul  oriler  with  three  jiimhlos). 
and  the  neighbor-joining  approach  (wiih  the  Fel.senslein  torrecnon)  in  ARB 
The  robustness  of  the  tree  topology  was  verified  thmugh  calculating  btxiLslrap 
values  lor  the  neighbor-joining  tree  and  through  comparison  of  the  loptilogv  ol 
the  trees  cunsiuicted  using  the  different  approaches 

Analysis  n(  molecular  evolution  of  the  novel  pmai  lineages.  The  mnleeuLir 
evolution  of  the  novel  /louj.-f  lineages  was  invesngafed  using  the  codemi  exociit- 
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TABLE  1.  Primer  description  and  thermal  profiles  for  PCR 


Primer  pair 

Sequence*  {5'-.3' ) 

No.  of  PCR 
round 

Thermal  prohle" 

Molecular 

analysis 

A6S2^ 

GGNGACTGGGACTTCTGG 
G/\y\SGC  NGAG  AAGAASGC 

1 

94«C.  4.ss;  (j2->2°C.  00s;  72^C.  ISOs  (.30  cycles)'* 

T-RFLP 

mUktlR.  A(>.5(IR‘‘ 

GGNGACTGGGACTTCTGG'’ 

C  C(  K  i  M(  iC'A AC  (  ;TC  Y7TACC . 
ACGTC  CTTACCG  A/\GC;T 

V 

94®C.  45s;  56°C.  60s;  72°C.  WIs  (22  cycles) 

M1.3F 

M13R 

GTA.A.AACGACGGCCAG 

C  AGO/ VVXCAC.CTATG  AC 

Q4T,  45s;  55 T.  60s;  72'’C.  60s;  (25  ctyclcs) 

Scqtioncing 

"  All  PCR  profiles  began  with  an  initial  denaluralion  at  94°C  for  3  min  and  ended  with  a  final  elongatntn  step  at  72'’C  for  III  min.  prior  to  holding  temperature  at 
4''C. 

*  Rctcrence  24:  AlH'f  is  the  forward  primer.  AbS2  the  reverse. 

'‘Touch-down  PC  R  was  used  from  62  to  52"C'.  After  each  c-ycle.  the  annealing  temperature  was  decrciised  hy  O  .^’C  until  it  reached  52"C'  (2S). 

Primer  laheled  with  5<arlHixynuoreseein. 

'  For  mb66IR.  sec  reference  12;  fur  A650R.  see  relerencv  lU. 


able  of  the  Phylogenetic  Analysis  by  M;L\imum  Likelihi«>d  (P.AML)  pr«)gram 
(5S).  The  input  nucleotide  flics  contained  a  4iS.1- nucleotide  portion  of  all  fmtoA 
sequences  s)i«wn  in  Fig.  I  (with  the  e.\eeption  of  sequence  “EfiFB-b“  |AJ57%6it| 
from  the  WB.^FH-A"  elade.  which  was  I(k>  short  to  include,  and  "LOP .A  12.6  * 
(AF3.^S(4.^1.  ••IY-6.4H”  |AY2.36.^1«|.  and  “RA  14"  [AFI4<S.S21|.  which  were 
added  TO  Fig.  1  during  final  revisions  of  the  nraniiscript:  in  addition.  "Vlp*^" 
[AY3725it|  was  removed  from  Fig.  I  during  bnal  revi.sions.  hut  was  present  in  the 
PAML  analysis).  To  reduce  the  level  of  sequence  divergence  to  within  recom¬ 
mended  levels  (Z.  Yang,  personal  communication),  the  PAML  analyses  were  run 
on  the  two  hah'cs  of  the  innoA  phylogenetic  tree  separately.  One  half  contained 
ilie  type  1  "WB5nt-A”  JR2.  and  JR3  chides  along  with  Ihc  two  :\hrf>so- 
t  tH  cus  imui  i  sequences.  The  other  half  contained  the  type  11  “RA  I4*‘  and  JRl 
dados  as  well  as  MfihyUn  m  uJift/ttla.  Due  to  the  divergence  of  the  two  unui.-i 
sequenevs  fn>m  the  /mioA  sequences  in  members  ul'  Ihe  class  (Janimaprote<)hac- 
unit,  the  type  t  side  of  ihe  tree  was  still  at  the  limits  of  acceptable  divergence  fi>r 
the  PAML  program,  and  so  .nnah'ses  for  that  side  of  the  tree  were  also  run 
without  the  iimitA  scquencc.s.  All  sequences  wore  in  trame  nnd  nligncd  (using 
MacCUidc  4.03  PPC;  Sinauer  AsscKiates.  Inc..  SnnUcrlanU.  MA).  and  the  lew 
ambiguous  sites  were  assigned  Ihe  nucleotide  of  their  nearest  phylogenetic 
ncighlwrs.  The  input  tree  fll«^s  were  created  using  PAUP  4. Oh  10  {4*)).  using 
analysis  by  disiance.  TKighhor  joining  with  Jukes-Canior  cxirrcxiion.  and  tics 
broken  randomly.  Their  topology  matched  that  of  the  tree  (Fig.  1)  presented  in 
this  paper. 

Bmnch  lengths  were  estimated  hy  Ihc  PAML  program  using  ihc  onc-ratio 
iikkIcI.  and  then  those  branch  Icngtlis  were  used  as  Ihc  tuiiial  values  lor  branch 
length  estimation  in  further  mi>dels  performed.  In  the  codeml  contnil  file,  the 
ni.ijoriTy  of  parameters  were  left  in  their  def.'inlt  specifications,  w’lth  the  folkiwing 
c.xccpiions;  runmodc  =  0.  seqtypc  =  1.  OKlonFroq  =  2.  MinJel  =  Oor  2.  and.  for 
the  multiratio  models.  hx_blcngth  1. 

The  one-ratio  mixiel  was  run  to  provide  an  estimation  of  a  single  nonsvnony- 
mous-to-synonymiHis  substitution  (“dN  dS")  ratio  tor  each  hall  of  the  tree.  A 
series  of  two-ratk>  uiudeU  were  Ihcu  nm.  to  allow  the  dN/dS  ratio  of  iIk  three 
novel  lineages  to  vary  in  turn.  Lastly,  the  dN-'dS  of  each  major  branch  and  clade 
(.'isdentsted  in  Fig.  2)  was  allowed  to  vary  simultaneously  under  the  freely  varying 
model,  generating  maximum  likelihood  estimates  for  all  dN/dS  values  across  the 
tree  (57). 

To  test  the  robustness  of  Che  parameter  estimales.  all  analyses  were  also  run  t>n 
various  subsets  ol  the  taxa.  with  little  variation  in  the  results;  this  is  consistent 
with  other  studies  iliat  have  shown  that  codeml  is  robust  to  sampling  (57.  5*f).  hi 
addition,  all  analyses  were  run  at  least  twice  to  ensure  that  parameter  estimates 
were  likely  global  rather  than  kKal  optima. 

The  likelihood  ratio  test  was  used  to  as-sess  the  goodness  ot  tit  of  the  two- ratio 
models  lo  the  data  aixl  to  compare  it  with  that  of  the  onc-ratio  model.  Tliis 
allowed  u.s  to  test  whether  the  dN/US  ratios  on  Ihe  branches  leading  to  the  three 
novel  cladcs  were  significantly  dilferent  trom  the  background  dN/dS  ratio  in  the 
rciiiauidcr  of  the  tree  (57). 

T-RFLI*  anuiysis.  The  creaikm  of  terminal  restriction  fragments  (T-RFs)  from 
pmoA  genes  was  earried  out  as  previously  described  (28).  After  purification  with 
QIAquick  spin  columns  (QIAOEN.  Alameda.  CA).  approximately  IDO  ng  of  rhe 
amplicons  was  digested  separately  with  20  U  of  the  restrietkm  endonuclease 
Mspl  (New  England  BioLabs,  Beverly.  MA).  The  digestions  were  carried  out  in 
a  total  volume  of  HI  r.1  for  3  h  at  3T’C  according  to  the  instructions  ol  the 


manufacturer.  Enzyme  inactivation  was  carried  out  by  incuhation  al  65T'  for  20 
min  The  subsequent  T-RFLP  analysis  was  performed  at  the  Genomics  Tech¬ 
nology  Support  Facility  (httpr  /genomics. msu.edu.*;  Michigan  State  Univetsity. 
East  Lansing.  Michigan).  Briefly,  the  T-RFs  were  separated  by  capillary  electro¬ 
phoresis  on  an  ABI  Ptism  3700  DNA  analyzer.  The  DNA  hands  were  automat¬ 
ically  identified  nnd  sized  using  GcneScan  sofrw'arc  (Applied  Biosystcnis.  Foster 
City.  CA)  and  comparison  to  internal  lane  standards.  The  relative  ahuuUanccsof 
inUiviUuiil  T-RFs  in  a  given /)/iio.4  PCR  product  were  calculated  based  on  the 
peak  height  of  the  individual  T-RFs  in  relation  to  the  total  peak  height  of  all 
T-RFs  delected  in  the  respcctKc  T-RFLP  community  Kngerprint  pattern.  Tltc 
peak  heiglits  were  atitoinatkally  qnantilicd  by  the  GcneScan  software  To  verify 
the  assignments  of  T  RFs  lo  our  detected  l  gene  types,  we  also  tested 
individual  clones  hy  T-RFLP  an.! lysis. 

The  T-RFLP  results  were  highly  reproducible.  The  coeflieient  of  variation  of 
the  relative  signal  intensity  of  the  T-RFs  between  different  DNA  i.solations  from 
the  same  soil  sample  ranged  from  .3  to  1(1.1'^.  The  coefficient  ot  variation  of  the 
relative  signal  intensity  of  the  T-RFs  between  different  PCRs  from  a  single  DNA 
sample  ranged  from  1  to  0.5'1(.  Tliose  varialiuus  are  in  the  same  range  as  those 
previously  repotted  (28).  The  variations  between  different  digests  from  Ihc  same 
PCR  product  and  between  different  clcctroplioretic  runs  from  the  s.nme  digest 
were  negligible.  This  is  conswieni  with  prevkius  syslemalic  evaluations  of  Ihe 
T-RFLP  method  (.38). 

.Statistical  analysis.  The  relative  abundance  data  were  analyzed  with  a  split- 
plot  analysis  ot  vnrinnee  performed  using  the  MIXED  pnKcdurc  in  SAS  (SAS 
Institute,  liic  .  Cary.  NC).  Means  wete  estimated  as  least-square  incaiis.  and  the 
degrees  of  freedom  were  estimated  using  the  Salierthwaite  approximation.  The 
dat.T  were  arcsine- transformed  before  analysis. 

Nucleiilide  sequence  accession  numbers.  The  partial  prmf.A  gene  sequences 
determined  in  thb  study  have  been  deposited  in  the  EMBL.  GcnBank.  and 
DDBJ  nucleotide  sequence  daiabicses  under  the  acxessinn  numbers  AY6.54669 
through  AY654732. 

RESULTS 

Characterization  of  pmoA  genes.  rii>nc  libraries  were  ci>n- 
slructcd  using  pmoA  PCR  prixlucts  from  three  experimental 
plots:  two  with  elevated  CO,,  temperature,  precipitation,  and 
nitrogen  (plots  ID  5  and  ID  6())  and  one  with  ambient  levels  of 
CO2,  temperature,  precipitation,  and  nitrogen  (plot  ID  107). 
In  tiftal,  64  clones  were  analyzed  ( 1 1  clones  for  ID  5,  21  clones 
for  ID  60.  and  32  clones  for  ID  107).  Figure  i  shows  the 
phylogenetic  athliation  of  all  clone  sequences  analyzed  in  this 
study. 

Five  sequences  formed  a  distinct  cladc  (JRl)  that  was  re¬ 
lated  to  the  ‘'RA  14“  cladc,  environmental  sequence  types  that 
have  been  hypothesized  to  represent  uncultured  “high-aflinily" 
methanotrophs  capable  of  oxidizing  methane  at  atmospheric 
concentrations  (21,  25).  The  similarity  in  DNA  sequence  be¬ 
tween  JRl  and  the  “RA  14“  cladc  was  approximately  S0%. 
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LOPA  12  6(Af36«043j 
-  nr-«4^(Ar236£ir) 

lUifV^tCiiAM  A:<(«pA*to<AJ278727) 

AV»S4ri4 

J*54»^  (AVI547101 
JNS4S04jkY«>l7l7) 
JR904M<4  tAYSWiZI 

JK««4M'3  (AYSjM7l  1} 

-  -  V14»527j 

_ (AY500134J 

U  14  {AF1 44821) 

'A  (Af  i0072r| 

_  Vtoll  iAY37i380) 

i«e(fyownu>  fricno^CKVium  (AJ4  j  1 3«i]> 
Uimviocyslft  iAj  4840001 

JRI^-MI-11  (AY«8471«I 
JM107441-17  (AY«947i4) 
JA107.«1.l2fAYi547«| 
Jltl07.««1-4  (AY8M71I) 
JIM8748W2  (AY6>473^ 
JA107<50-11  tAY*jA72y 
JK1674M-1t  ^Y8M72T) 
«^>jAx.r>tepA>vuj  'MPAa^'IOAC) 

J  R 1  ()74M 'SOtAYK  M  721} 

_  JfM07<SO’ir}AYeS47ioj 
JR1 8  7450.1  •  (AY654723) 

'3  :AJ294461  } 

MAmvAxyxtfS  tt>.  SC2  (AJ431367) 
Top  A  13  1  (AF3W0S«1 
JR  187461-11  (AYfS4470) 


,  (AYfS4470) 

1.  .mi87489*14  (AY854«74i 
TI_  JIM074».li  (AY8548r3) 

\  JR  107  480-23  iAY«S4«72) 

\  JR  107450-13  TaY0S4«7i) 

JR  1 07430-17  fAY86487S 

_  J1^4S0-1  (AY  888008} 

_  JRf040l-3(AY8S08M] 

_  JR984M-0  (AY8V8M5} 
r-  — ‘.l4(XY854tt2} 

4  (XY6S487I] 

_ 1>10(AVS»«Mm 

JR  1 8  7-08 1  - 1 5  ( A  Y  83468 j) 

JR1 07481 -21  lAY8S4e^ 

JR187481-35  UYt$467T{ 

JR604S(M  ;AV8S4000| 

JR«O4$0-7  &YiS4681) 

JRM304  CAYfUrrO) 

JR  1874604  (AY884«iS} 

JR1074504  lAY8S480lj 

- AY8546ill 

.Y8340an 
kj;AY»MM8) 

lY8  34700} 

}Y0S40M 
IAY8S4702} 

rtJR80481>7  Ha  Y83470n 
JR80-M14  (AY8S4784} 

.  JR 1874«1.»  (AY  834706} 

JR  107 -MI-23  (XY0S47M 
_  JR1 07401  •11HaY8347071 
<  ^107-M1-30(AY6S470«| 

H  ,  JR107-M1-24(AYOS87aO) 

H  JR1I7-M1-37  UYM4716i 

NHrv90C0CGU9  <x**nos  (AJ2967001 
3  (47153344) 
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JR  5 
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Type  II  methanotrophs 
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«h#ar>Qph44c  m*1h#f»ovi<)lT4c  Rr4«  (U«#302) 
*- ffijApcAWlim  'U80303) 

--  -  •  xc<ic  c^muiA/s (Ar5^3666) 
lAfoOuxn  cntnAOM  (AJSdi  B3r 
so  LW14  iaY0C72( 
MePiwofncrooi^  a/Oum  (U3 10!. 
AMhyAjmonac  msPianica  <U3 


.  M%ttrf*OtnKr)tMn  (AFjO?! 

U»Ph3aO*C«*r  «p  BR5  1  {A7O109O2J 
,  Atalh)4nrn)cfD8t4i)*ip«M0icijm  (U316S- 


_|-  JR08482-7  1AY8S4712) 
JR80-M24  (AY8i472«} 
100  i-P- JR«04«-18  [aY^731) 
I— fl— JROO-M2-1  (Ay  8^20) 

4 _ L  - 


JROO-M2-1  ,  . 

JR80482-7  tAVl54730} 
t4Jtr9*Of>n  «mjth*OnTU9  /VB08J3) 

AMnMOnOMS  AurOMM  Af  337 107) 
-  RA21  (AF 1485221 


JR  6 


Type  I  methanotrophs 
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FK'i  I  Phyloncnctic  rchitionships  umong /W7i7vl  ncnc  types  identified  in  the  Jiispcr  Ridge  GUibiil  C  hange  Experiment  and  pt}\oA  and  amoA 
gene  types  available  in  public  databases  (J,  3,  (>-0.  17.  21,  2.*',  2^30.  .32.  .3.*i,  37,  45,  53).  Sequences  obtained  in  this  study  are  shoivn  in  boldtacc 
ty’pc  with  the  prefix  'JR"  and  arc  designated  cladcs  JRl  to  JR6.  The  enviionmental/>/ncv4  sequences  used  for  reference  were  rctriex'cd  from  vai  ious 
habitats,  ns  follows:  foiost  soils  (AFi48527,  AF148521  |25|,  AF14S522  115),  AF2(X)727  121).  AY5lK}|.M.  AY37136()  129)),  rice  liclds  (AJ2W%I 
j2S|).  peat  s<iil  (Al  35KiU3.  Al\35«()46  |35|,  A  Y2.365IS  [9|),  and  upland  grasslaml  Mills  AJ57966M,  AJ5796ftK,  AJS7W»67  (321)  The  scale 

bar  coi  responds  to  0. 1  substitutions  per  nucleotide.  The  tree  was  calculated  using  475  nucleotide  positions  and  the  neighbor  joining  approach  (with 
the  Fclsenstein  correction),  via  the  ARB  program  package  (.33),  Flic  tree  topology  was  confirmed  using  the  maximum  likcliluMHj  appioaeh 
Bootstrap  values  were  eiiltulatcd  using  1,(KX)  replications  AOB,  ammonia-oxidizing  bacteria. 
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FIG.  2.  The  dN/dS  values  of  the  major  lineages  oipmoA  as  estimated  using  the  codeml  executable  of  the  PAML  program.  The  numbers  ai  each 
branch  are  ihc  dN  dS  ratios  estimated  by  the  program  under  the  freely  varying  mtxlel.  which  allow'ed  the  dN.dS  tif  each  major  branch  and  clade 
(as  denoted  with  dashed  circles)  to  vary  simultaneously.  (The  asterisk  at  the  branch  connecting  the  type  1  and  "WB5FH-A**  clades  to  the  other 
Alphaproteobocteria  clades  indicates  a  noncompulable  dN/dS  ratio,  where  dN  =  0.03  and  d$  =  0).  Nave!  clades  are  shown  in  boldface  type,  as  are 
the  dN.’dS  latios  of  the  blanches  leading  to  these  clades.  The  dN ‘dS  ratios  within  each  clade  are  not  shown.  Analyses  were  run  on  each  half  of  the 
ptTUK‘i  tree  independently;  the  split  between  the  Alphaproteohacteria  and  Gammuprotet^ictena  sections  is  indicated  by  a  dashed  line.  MOB. 
mclhanc-«nidi/ing  bacteria. 


Two  sequences  (“  JROO-O.'SO-.V  and  “JR6l)-6.S0-8‘‘)  grouped 
lightly  with  the  “RA  14"  clade. 

Two  further  clades  (JR2  with  22  sequences  and  JR3  with  !4 
sequences)  were  moderately  related  to  each  other  but  showed 
no  close  relationship  to  any  cultivated  methanotroph  species. 
They  grouped  distatitly  Ut  type  I  tnethanotrophs,  with  a  DNA 
sequence  .similarity  of  approximately  72%.  While  representing 
distinct  lineages,  JR2  and  JR3  branched  together  with  the 
**WB.SFH-A"  clade.  the  second  novel  pmoA  lineage  suggested 
to  represent  atmospheric  methane  oxidizers  (32).  Although 
supported  by  a  bootstrap  value  below  5i)%,  the  common 
branching  point  of  JR2,  JR3,  and  the  “WB5FH-A“’  clade  was 
supported  by  a  tree  calculated  using  the  maximum-likelihood 
approach.  In  contrast,  a  neighbor -joining  tree  calculated  from 
deduced  amino  acid  sequences  favored  a  common  branching 
point  of  the  “WB5FH'A‘’  clade  with  type  I  methanoirophs; 
however,  the  IxnXstrap  value  was  again  below  50Ci.  Phyloge¬ 
netic  analysis  consistently  suggests  a  common  evolutionary  or¬ 
igin  for  the  sequence  clu.slcrs  JR2  and  JR3  and  the  amoA  gene 
of  Nitms(X’<HCu<i  nceani  (an  ammonia-oxidizing  baclerinm  that 
is  capable  of  using  methane  as  a  carbon  source  and  whose 
prevalenee  is  thought  to  be  restricted  to  aquatic  systems). 

Eleven  pmoA  sequence  types  (clade  JR4)  were  closely  re¬ 


lated  to  Methylocystis  pamts,  a  relatively  well-eharacierizcd 
member  of  the  type  II  methanoirophs.  We  also  found  six  sc 
qucnces  (clade  JR.'^)  that  grouped  together  with  a  novel  pmoA 
lineage  previously  deseril>ed  as  a  diverged  second  pmoA  gene 
copy  present  in  various  strains  of  type  II  methanoirophs  (.Sll). 
JR5  and  the  novel  pmoA  copy  of  one  of  the  representative 
species  {Methylocystis  sp.  strain  SC2)  had  a  DNA  sequence 
similarity  of  85%. 

In  summary,  we  identified  pmoA  gene  types  belonging  to 
live  ditlerenl  lineages  within  the  phylogenetic  radiation  of 
the  pmoA/ot?toA  family.  Three  of  these  clades  (clades  JRl, 
JR2,  and  JR3)  have  DNA  sequence  similarities  of  80%  or  less 
with  previously  described  pmoA  variants. 

Nonsynonymous/synonymous  substitutinn  rates.  The  ratio 
of  nonsynonymous  to  synonymous  nucleotide  substitution 
rates  (dN/dS)  was  determined  for  each  novel  clade.  The  over¬ 
all  dN'dS  ratio  (as  calculated  with  the  oiie-ratio  model  of 
codeml)  was  D.l !  for  the  type  I  side  of  the  pmoA  tree  and  0.10 
for  the  type  II  side  (data  not  shown).  The  dN/dS  raik^s  (as 
calculated  with  the  freely  varying  mmlel  of  cmlemi)  along  the 
branches  leading  ti)  the  three  novel  lineages  (JRl.  JR2,  and 
JR3)  were  0.10,  0.17.  and  0.20,  respectively  (Fig.  2).  The  like¬ 
lihood  ratio  test  showed  that  these  dN/dS  ratios  were  not 
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significiinily  dillcrcnt  (I*  <  0.1)5)  I'rom  ihc  buckground  dN'dS 
nitiON  in  their  respective  sides  ol  the  tree  {diitii  not  shoNvn). 

I'hcsc  dN;dS  riitios  could  result  Irom  ^vo  possible  scen;irios. 
Cliides  JRl,  JR2.  iind  JR5  could  hiive  diverged  recently,  with 
iiisutficjent  time  loi  their  dN'dS  ratios  to  reflect  any  change  in 
lunetional  slate,  or  the  three  chides  could  have  diverged  ear¬ 
lier.  with  the  low  dN/dS  ratios  reflecting  continued  purifying 
selection.  Using  codeml.  we  estimated  the  relative  divergence 
of  the  PmoA  clades  by  estimating  the  likely  numbers  of  syn- 
onymtms  changes  along  each  of  the  three  lineages,  based  on 
the  assumption  that  syiumymous  cluingcs  are  not  acted  upon 
by  selection  and  accumulate  steadily  over  time  (55)  (diver 
gence  limes  can  also  be  rinighly  estimated  by  inspeciitin  of 
brunch  lenglhs  in  Fig.  I;  however,  these  lengths  reflect  bi>th 
synonymous  and  non  synonymous  changes).  1  here  were  ap- 
pro.xiniately  65  synonymous  changes  along  the  branch  to  JRl. 
5cS  to  JR2, 54  to  JR.^.  and  H)9on  the  branch  leading  to  JR2  and 
3.  This  is  not  substantially  less  than,  for  example,  the  47  syn- 
imymoiis  changes  leading  to  Mi'fliylocupsu  ucidiphila,  the  1 1^ 
leading  to  the  *’WB5FI1-A*’  chide,  and  the  55  leading  ii^  Type 
I  methanotrophs  (data  not  shown).  Thus,  the  cladcs  JRl.  JR2. 
and  JR.^  di^  not  appear  to  have  diverged  recently  compared  to 
other  known  p*noA  clades.  and  so  their  estimated  dN/cIS  ratios 
suggest  that  they  are  undergoing  purifying  selection,  encoding 
functionally  active  proteins. 

Conservation  of  amino  aeW  residues.  1  he  pMMO  and 
AMO  genes  arc  cviilulionarily  related  (24).  and  at  the  amino 
acid  level  they  share  a  number  of  highly  conserved  residues 
(25.  42.  52).  Based  on  alignments  of  the  predicted  peptide 
sei|iiences  of  the  «  subunits  of  1 12  particulate  methane  mono¬ 
oxygenases  (PmoAs)  and  .^4^  ammonia  mon(K).xygcnases 
(AmoAs).  Tukhvatullin  ct  al.  (52)  identified  residues  common 
to  both  proteins.  Ricke  et  al.  (42)  e.xtended  this  analysis  to 
include  the  second  PmoA  gene  copy,  PmoyV2.  present  in  many 
Type  11  methanotrophs  (5l)).  The  inferred  translation  of  the 
region  amplified  by  the  primers  used  in  our  study  spans  16  of 
these  highly  conserved  residues  (Table  2).  All  memhers  of 
novel  clades  JR2  and  JR.3  each  had  all  16  of  these  conserved 
residues.  All  members  of  JRl.  JR4.  and  JR5  had  15  of  the  16 
conserved  residues,  with  all  but  one  member  in  each  group  also 
having  the  16th  residue  Among  the  residues  common  to  both 
PmoA  and  AmoA.  Tukhvatullin  el  al.  (52)  proposed  a  subset 
of  seven  lliat  could  potentially  lx;  the  metal  ligands  of  the 
active  site.  The  translation  of  our  amplified region  spans 
three  of  these  (residues  ElOO.  Y157.  and  HI69).  which  are 
conserved  in  all  Jasper  Ridge  sequences,  A  further  set  of  lour 
residues  were  identihed  as  potential  non-aclive-sile  metal  li¬ 
gands.  which  could  additionally  stabilize  the  peptide  structure 
(52);  our  amplilied  region  spans  two  of  these  (residues  DIH2 
and  Y106).  which  are  also  conserved  in  all  Jasper  Ridge  se¬ 
quences. 

In  addition.  Holmes  et  al.  identihed  2 1  residues  that  could 
distinguish  PmoA  from  most  AmoA  sequences  (25).  Our 
pnufA  ainplieoiis  spanned  16  of  the  putative  PmoA/AnioA 
diagnostic  residues.  All  of  the  cladcs  we  detected  (with  the 
exception  of  JR6)  shared  a  high  percentage  of  amino  acid 
residues  typical  of  PmoA  (‘Fable  2).  JR4  had  all  16  of  the 
PmoA-speeilie  residues,  while  JR5  and  JRl  had  13  and  II, 
respectively,  of  the  PmoA  residues,  and  in  all  ca.ses  the  mis¬ 
matches  were  amino  acids  belonging  to  the  same  amino  acid 


similarity  gr<mp  (22)  as  the  conserved  Pmi^A  residue  (‘Fable  2). 
Both  JR2  and  JR3  shared  14  of  the  16  PmoA  residues.  The  two 
mismatches  in  JR2  anti  one  in  JR3  were  in  the  same  amino 
acid  similarity  groups  as  the  conserved  residues,  while  the 
other  mismatch  in  JR.'  was  a  perfect  match  in  half  i>t  the 
sequences  in  this  cladc. 

T-RFLP  community  profiles.  Figure  3  shows  a  represent;! 
live  community  T-RFLP-proHle  and  the  assignment  o\  the 
T-RFs  tt!  the  sequence  clusters  delected  in  our  study.  All 
clones  produced  the  T‘-RFs  that  were  predicted  based  t»n  the 
sequence  information  (data  luU  shown).  All  pnufA  clades  de¬ 
termined  by  comparative  sequence  analysis  could  bo  consis 
tenily  recovered  by  T-RFLP  c(!mmunity  analysis.  JR2.  JR3. 
and  JR5  exhibited  specitic  F-KFs  (20»S  bp.  373  bp,  ami  34^  bp. 
respectively),  eonlirmed  by  in  silico  analysis  id  the  publicly 
available  pmnA  gene  sequences  (comidned  with  the  sequences 
generated  in  this  study).  JR4  produced  a  T-RF  of  245  bp  as 
anticipated  (this  is  the  specific  T-RF  for  the  type  II  meihano- 
trophs)  (2S).  However,  no  specific  F  RF  could  be  generated 
for  chide  JRl  by  use  of  MspI  (i  e..  the  Hl)-bp  T-RF  generated 
by  JRl  can  also  be  pnxluccd  by  digestion  o[ ptnoA  sequences 
from  Maliyl(K(KLUs  cap.sufunis  anil  related  species,  as  well  as 
3/  (icuiipliilii).  A  T-RF  of  .34  bp  was  indicative  for  sequences 
belonging  to  the  “RA  14“  cladc. 

Although  not  conhrnied  by  cloned  sequences,  our  T-RFLP 
community  proliles  indicated  the  presence  ol  various  memhers 
of  type  I  melluniidroplis  (e  g..  T-RFs  of  440  bp.  .505  bp,  and 
51 1  hp.  with  the  latter  bvo  representing  undigested  pmoA  se 
qiienee  types  without  the  MspI  recognition  site)  (2S).  although 
in  low  abiinilaiicc  (generally  less  than  4'"  /  of  the  total).  U  e  can 
think  of  at  least  two  possible  explanations  for  the  absence  ol 
pmoA  sequences  related  to  type  I  methanotrophs  in  onr  clone 
libraries,  namely:  (i)  low  relative  abundance  of  the  type  I  me¬ 
thanotrophs  combined  with  none.xhaustive  clone  sampling,  and 
(ii)  cloning  biases  against  type  1  sequence  types.  Recently  re¬ 
ported  discrepancies  between  the  community  composition  of 
pmoA  clone  lihrarics  and  pmo^-hased  F-RFLP  analysis  (2S. 
40)  suggests  that  such  biases  can  Ik*  present.  Given  lliis  pos.si- 
hiliry.  we  did  not  attempt  to  determine  the  response  of  type  I 
methaiunrophs  to  simulated  global  change  in  our  study. 

The  response  of  methanotrophs  to  simulated  global  change. 
We  generated  T-RFLP  ei^mmunity  proliles  of  methanotrophs 
from  all  replicate  treatments  of  our  multifaetorial  climate 
change  experiment  (8  replicates  of  16  treatments,  for  a  total  of 
128  soil  samples).  Simulated  global  change  did  not  signilicantly 
niter  the  number  of  T-RFs  present  (the  phylogenetic  richness 
ol  the  niethanolroph  community)  or  the  magnitude  i!l  Shan¬ 
non.  Simpsmi,  or  Berger-Parker  diversify  indices  (34)  calcu¬ 
lated  from  the  T-RFLP  data.  However,  the  simulated  global 
changes  did  alter  community  eompositiiin.  The  relative  abun¬ 
dance  of  type  II  methanotrophs  (cladc  JR4)  sign i flea ntly  de¬ 
creased  under  elevated  preeipilation  (F,  >4  =  7.89;  P  =  0,0068) 
(Fig.  4)  and  elevated  temperature  >4  =  4.12;  F  =  0.0469) 
(Fig.  4).  However,  these  effects  were  not  additive:  i.e.,  there 
was  a  significant  antagonistic  interaction  between  precipitation 
and  temperature  (F,  ,4  =  8.31;  /*  »  ().tRI55)  (Fig.  4)  such  that 
the  effeci  of  both  treatments  together  was  less  than  that  ex¬ 
pected  from  their  individual  eflecis.  In  contrast,  the  relative 
ahundanee  of  the  novel  niethaiiotropli  cladc  JR2  responded  to 
simulated  global  change  very  differently  (Fig.  5).  Elevated  pro 
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TABLE  2.  Presence  of  consei'vcd  and  diagnoslic  AA  residues  Id  PmoA  and  AmoA  across  taxa" 


A  A  A  A  A 


•’  The  amino  acids  (AAJ  arc  numbered  according  to  the  published  sequence  for  A/,  tapsiihunx  PmoA  (47).  Uppercase  letters  are  residues  conserved  in  oJ  the 

reference  data  set:  lowercase  letters  are  res'idues"  conserved  in  of  the  reference  set  Letters  in  p;irentheses  indicate  eoaservation  within  AA  similarity  groups  (A 

PaGST:  D.  QNEDBZ;  H.  HKR;  1.  LIV'M.  F,  FYW).  Ties  are  indicated  by  both  letters  with  a  slash  l4tween  them.  Residues  lOtl.  157.  168.  182,  and  1%  (with  '  t^h>w) 
are  putative  metal- binding  residues  as  described  by  Tukhvaiullin  el  iiL  (52).  Residue  columns  containing  gr.ay  backgrounds  are  Ami»APmoA  diagmsslic  sites  described 
by  Holmes  ai  al  (25).  Residues  on  n  black  background  arc  generally  agreed-upon  AmoA  'Pmo.A  conserved  sites  (Z5.  4i  52).  Residues  in  bold  type  and  framed  arc 
AmoA  di>ign«viiic  siics  for  ammonia  oxidizers  from  ihe  ( itimnMimHarhtu  leria. 

*  Type  tl  PmoA  (n  6).  including  M.  /xmiiT.  .i/ct/iv/iAvr/b  trehinouk's.  Mcfhviotyuiy  (richt/yptfrium,  uncultured  bacterium  AF.^58046  (35).  M<r(h\hH\ytis  strain  SC2 
(17).  and  uncultured  bacieiium  M84  P3  (AJ2Wd61)  (28). 

*■  clade  /\inoA  (n  8),  including  ;V.  cKcaiii  (U%6ll)  (37),  N.  (kcmiii  strain  AFC'27  (/\F50y(X)l)  (53).  strain  SW  (AF5(.WDD3)  |53).  strain  AFC 

(AF5089<W)  (.53).  strain  AFC12  (AF.5089%)  (5.3).  strain  AFC36  (AF5089<)5)  (5.3).  Witro.uumrus  sp.  strain  Cl  13  (AF153344)  (2).  and  uncultured  haeterium  BAC6 
(AFn70<)87)  (45). 


cipilation  and  Icmpcraliirc  incrca.scd  the  relative  ahiindanee  i)f 
this  clade,  and  there  was  a  significant  antagonistic  interaction 
between  elevated  preeipHalion  and  tempeniiure  iF^  24  =  13.4S; 
r  =  0.1X112)  (Fig.  5)  as  well. 

DISCUSSION 

PmuA-based  approach  fur  methanotruph  community  anal¬ 
ysis.  The  aim  of  the  prcseiit  study  was  to  explore  the  methan- 
o(rophic  diversity  of  a  Californian  upland  grassland  and  to 
assess  whether  a  shift  in  the  mclhanotrophic  community  struc¬ 
ture  in  response  to  simulated  global  change  was  dctec  table.  We 
assessed  methanotroph  diversity  in  this  study  using  a  cultiva¬ 
tion-independent  approach,  with  pmoA  as  a  molecular  marker. 
To  date,  most  studies  involving  pw«/l-based  analysis  of  me- 
thanotrophic  populations  have  u.sed  the  primer  system  A1S9F- 
682R.  These  primers  also  amplify  amoA,  which  encodes  the 
homologous  subunit  of  the  ammonia  nioiuxixygeiiasc  in  nitri¬ 
fying  bacteria.  Reverse  primers  that  discriminate  against  the 
amoA  (c.g.,  mb66l)  and  highly  specific  primers  with  intended 
target  specificity  for  the  “RA  14*’  clade  (e.g.,  650R)  have  been 
applied  as  alternative  mellKxIs  for  studying  methanotroph  di¬ 
versity.  Ekmrne  ct  al.  (10)  tested  these  three  primer  sets  in 
various  sv)ils  and  found  that  one  primer  combination  alone  was 
not  sutticiciU  to  explore  mcthanotrophic  diversity. 


We  tested  five  different  primer  combinations  (three  single- 
round  PCR  assays  and  two  nested  PCR  assays)  in  order  to 
determine  their  potential  to  delect  a  broad  range  of  niclhan- 
otrophs  in  our  grassland  soil.  When  primer  set  A1H9F-682R 
was  used,  clone  libraries  created  from  the  single-round  PCR 
amp  I  icons  showed  a  high  representation  of  umoA  inserts. 
When  primer  set  AKS9F-nib661R  or  AIS9F-6.>(1R  was  used, 
the  clone  libraries  contained  a  large  number  of  nonspecitk 
inserts.  Nested  PCR.  however,  using  the  A189F-682R  primer 
set  in  the  first  round  and  either  reverse  primer  mb661  or 
reverse  primer  65()R  in  the  second  round,  generated  consis¬ 
tently  high  yields  of  pmoA  amplicons,  even  in  some  soils  lor 
which  single-round  PCRs  produced  little  or  no  pmoA  amplili- 
cation.  In  fact,  all  analyzed  clones  derived  from  nested  PCRs 
were  ''pmoA  ptssitivc.”  The  reverse  primers  we  used  detected 
different  components  of  the  methanotroph  community.  Primer 
650R  detected  the  chides  JRi  and  sequences  from  the  “RA  14” 
clade.  clade  JR3  could  be  detected  only  with  reverse  primer 
mb66i,  and  chides  JR2,  JR4,  and  JR5  were  detectable  with 
both  reverse  primers.  Therefore,  we  used  both  reverse  primers 
together  in  a  multiplex  (i.e.,  in  the  same  reaction),  nested  PCR 
approach  ftir  the  T-RFLP  community  analysis.  This  enabled 
us  to  simultaneously  recover  a  broad  range  of  distinct  pmoA 
cladcs  in  single  electrophoretic  profiles  for  each  sample. 
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FIG  3.  Representative  T-RFLP  protilc  ot  the  mctlianotropli  conimiinity  and  the  assignment  (mro'vs)  of  the  T-RFs  to  known  niethanotiophs- 
sublincagcs  ,ind  lopttutA  gene  types  determined  in  this  study.  The  phylogenelie  lice  w as  graphieally  miHlilied  from  Fig.  I.  Arrows  with  dashed  lines 
indieate  the  existence  of  multiple  sequence  types  that  potentially  can  produce  the  respective  T-KFs  aecoiding  to  the  sequence  iidoinialion  of  the 
ptnoA  database  (i.e.,  1  RFs  of  SO  bp.  44<l  bp.  .sl)3  bp.  anrl  .si  I  bp).  AOB.  ammonia-rnidi/ing  bacteria. 


MethanntrophU'  diversify.  We  disenvered  a  remai  kahly  high 
diversity  oi  pmoA  gene  types  in  our  study  (Fig.  1).  including 
those  closely  rclate'd  to  the  pnioA  of  known  nicnibcrs  of  the 
class  Alphuprofeohiuremi  as  well  as  gene  types  distinct  lri>m 
known  species  forming  hitherto  undescribed  pmoA  lineages. 
Within  type  II  methanotrophs.  we  found  sequences  closely 
related  to  M  punus  (clade  JR4).  as  well  as  the  recently  char- 
aeteri/cd  type  II  pmoA  gene  copy  (.SI)  of  MerltyhKystis  sp. 
(clade  JR5).  Interestingly,  the  relative  abundance  of  the  T-RFs 


I  l(i.  4.  l.ffeel  of  lenipciatuie  and  pi  ccipitalion  oii/;rmj>l  clade  .1R4 
(type  II  melbanoiropbs)  in  the  JKCJCt.  T  he  mean  relative  abundance 
of  JR4  is  depicted  for  all  samples,  grouped  by  temperature  and  ptc- 
cipitation  treatments.  For  example,  the  first  bar  depicts  the  mean  rel¬ 
ative  abundanec  of  JR4  from  all  experimental  plots  under  ambient 
temperature  and  precipitation,  including  those  under  both  ambient 
and  elevated  COj  and  ambient  and  elevated  nitrogen  treatments  («  = 
.32).  I  noi  hais  are  93^  ^  confide  nee  limits.  MGIk  inelbane-ivxidiiing 
bacteria. 


was  eonsistently  higher  lot  JR4  than  lor  Jli.^  in  (riir  I  -RFI  F 
profiles  (data  uol  shown),  which  agrees  with  the  tiiulings  of 
Tchawa  Ytmga  et  al.  (.SO)  that  not  all  type  II  methanotrophs 
possess  this  additional  gene  copy.  We  also  discovered  the  clade 
JR  1.  which  forms  a  distinct  subgroup  of  the  ‘RA  14*‘  clade.  the 
clade  that  has  been  putatively  identified  as  atmospheric  meth 
ane  consumers  (21.  2.^).  This  finding  considerably  expands  the 
known  depth  of  the  “RA  14”  clade  and  demonstrates  that 
metlianotropits  |>osseNsing  this  gene  type  arc  iK>t  restricted  to 
forest  soils.  W'e  did  not  detect  the  other  putative  atmo 
spheric  methane  consumers,  the  “W'B.SFH-A”  clade  (.32). 


I _ *1^  *niT  Ciftvitwl 


FKi.  r.tteet  of  temperature  and  precipitin  ion  on  novel  ptruy-X 
chde  JR2  in  the  JRGCE.  The  mean  relative  ahundance  ot  JR2  is 
depicted  for  all  samples,  grouped  hy  temperature  and  precipitation 
trciitnicnts.  Foi  example,  the  fiist  har  depicts  the  mean  relative  abun¬ 
dance  of  JR2  from  all  experimental  plots  under  ambient  tcmperainre 
and  precipitation,  including  those  under  both  ambient  and  elevated 
CO.  and  ambient  and  elevated  nitrogen  ticatnients  (a  =  .32).  I  iioi 
bars  are  confidence  limits.  MOB.  methane-oxidi/ing  bacteria. 
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altht^ugh  wc  did  discover  two  novel  dados  {JR2  and  JR3) 
which  are  distantly  related  to  the  “WB5FH-A''  clade. 

riiere  arc  several  lines  of  evidence  that  suggest  that  the 
three  novel  pmoA  dades  we  discovered  (JRl.  JR2.  and  JR3) 
cncixlc  functional  nionooxygenases,  with  a  primary  substrate 
of  methane  rather  than  ammonia.  All  three  of  the  novel  dados 
had  dN  dS  ratios  well  below  1  f  Fig.  2).  evidence  for  purifying 
selection  .‘'6).  If  these  genes  were  nonfunctional  copies,  a 
lack  of  selection  would  result  in  nonsynonymous  changes  oc¬ 
curring  at  the  same  rate  as  synonymous  changes,  pushing  the 
overall  dN'dS  ratio  towards  1;  how  closely  it  approached  1 
would  depend  on  the  divergence  time  of  these  cladcs.  The 
dN/dS  of  the  branches  leading  to  all  three  novel  chides,  how¬ 
ever,  are  not  statistically  ilitferent  from  the  ‘  background’ 
dN/dS  in  the  rest  of  each  respective  half  of  the  piiUfA  phylog- 
eny.  In  addition,  the  high  number  of  synonymous  changes 
along  the  branches  leading  to  the  three  novel  dades  suggests 
that  they  did  not  diverge  recently  (see  Results  above),  and  thus 
their  low  dN/dS  ratios  suggest  that  their  encodeil  proteins  are 
e.xpressed  and  functional. 

The  conservation  of  functionally  diagnostic  amino  acid  res¬ 
idues  pr()vides  further  evidence  for  retained  function  in  the 
novel  dades  and  for  their  substrate  spcciHcity  for  methane 
rather  than  ammonia.  The  novel  sequences  a)ntain  a  very  high 
percentage  of  those  amino  acid  residues  conserved  in  both 
methane  and  ammonia  momK)xygenases  (42.  52).  These  con¬ 
served  residues  include  those  proposed  to  bind  metal  ions 
within  the  active  site  and  at  secondary  stabilization  sites  (42, 
52).  as  well  as  a  majority  of  the  previously  identihed  PmoA- 
specitic  resi{liies  (25).  Among  the  mismatched  residues,  almost 
all  are  in  the  same  amino  acid  similarity  groups  as  the  PmoA- 
specihe  residues.  The  i\o\'c\  A fphapioteohacieria  clade  JRl  has 
the  lowest  number  of  perfect  matches  to  putatively  PmoA- 
specific  residues  (11  of  16)  (Table  2)  and  has  several  putatively 
AmoA -diagnostic  residues.  However,  two  of  these  “AmoA- 
like*  residues  are.  in  fact,  shared  by  several  other  PmoA 
cladcs.  Furthermore,  JRl  robustly  clusters  in  the  Alphapro- 
tcohacteria,  within  which  there  are  no  known  -containing 
members.  Thus,  the  toml  evidence  suggests  th,it  JRl  likely 
binds  methane  rather  than  ammonia.  The  novel  (iainmapro- 
feobacieriu  dades  JR2  and  JR3  did  not  contain  any  AmoA- 
diagnoslic  residues.  However,  this  picture  is  complicated 
somewhat  by  the  fact  that  the  only  known  ammonia-o.xidizing 
b,acteria  within  the  class  Gammaproteohacteria,  the  N.  oceani- 
like  clade.  also  lack  many  of  the  AmoA-diugnostic  residues, 
and  they  are  the  closest  phylogenetic  relatives  of  JR2  and  JR3 
(Fig.  1).  However,  based  on  protein  and  inferred-translation 
alignments,  there  appear  to  be  six  sites  that  distinguish  the  N. 
riectf/u’-like  AmoA  from  the  Gammapmteuhacteria  PmoA  (Ta¬ 
ble  2)  and  from  the  enzyme  encoded  by  JR2  and  JR3.  At 
position  71.  the  A.  r>ccr//i/-like  clade  contains  an  AmoA-diag- 
nostic  residue  present  in  no  known  PmoAs.  This  residue  is  not 
present  in  JR2  or  JR3.  At  five  other  sites,  the  /V.  oceaniAikt 
clade  contains  conserved  residues  distinct  from  known  PmoAs 
and  AmoAs;  nvo  residues  are  at  Pmc>A-/AmoA -diagnostic  pK^- 
sition.s.  and  three  others  are  at  positions  conserved  in  all  other 
PmoAs  and  AmoAs  examined  (Table  2).  strongly  suggesting 
functional  relevance.  None  of  these  residues  is  present  in  JR2 
or  JR3.  Filially.  Iiydropliobieity  plots  of  the  consensus  protein 
sequence  of  JR2  and  JR3  show  four  transmembranc  domains 


at  positions  identical  to  those  of  the  Gamfnapn>teuhaaeria 
PmoA  consensus;  in  contrast,  the  fourth  domain  of  the  con¬ 
sensus  for  /V.  rxra/f/-likc  AmoA  is  shifted  12  residues  towards 
the  C  terminus,  e.xactly  matching  the  position  of  the  eorre- 
spc)nding  hydrophobic  domain  of  AmoA  found  within  the  class 
Hetapnjreohactetia  (data  not  shown).  Together,  these  sequence 
anal)’ses  suggest  strongly  that  JR2  and  JR3  are  more  likely  to 
preferentially  bind  methane  than  ammonia. 

Kespon.se  to  simulated  global  change.  It  has  been  suggested 
that  feedback  between  methane  Ilux  and  climate  change  may 
be  due  to  changes  in  the  structure  of  the  methanotroph  com¬ 
munity  (31.51):  however,  it  is  unknown  whether  realistic  global 
changes  have  the  potential  to  alter  the  community  structure  of 
methanotrophs.  We  usedT-RFl  P  analyses  of /?ma4  to  provide 
a  molecular  prolile  of  the  methanotroph  community  and  to  de¬ 
termine  if  shifts  in  community  stiucture  cKcun  ed  in  response 
to  simulated  global  change.  We  observed  shifts  in  the  relative 
abundance  of  both  type  II  methanotrophs  and  the  novel  me¬ 
thanotroph  clade  JR2. 

Type  II  methanotrophs  decreased  in  relative  abundance  in 
respc)iise  to  increased  precipitation  (under  ambient  tempera¬ 
ture)  (Fig.  4,  compare  the  open  bars).  Previous  studies  have 
rcpcirted  decreased  methane  oxidation  rates  under  increased 
soil  moisture  (1.  11.  54),  possibly  due  to  limitations  on  the 
diffusive  transport  of  methane  through  the  soil  gas  phase  when 
soil  moisture  is  high  (31.  46).  It  is  reasonable  that  reduced 
o.xidatioii  rates  could  result  in  the  reduced  relative  abuiidanee 
that  W'c  observed  here,  although  this  was  not  directly  tested  in 
our  study.  We  also  observed  a  significant  decrease  in  the  rel¬ 
ative  abundance  of  type  II  methan(»trophs  in  rcsp<msc  to  in¬ 
creased  temperature  (under  ambient  precipitation)  (Fig.  4. 
compare  the  open  and  hatched  bars  on  the  left).  Although  the 
diffusion  of  methane  can  be  altered  by  temperature  (31).  and 
rates  of  methane  oxidation  are  known  to  vary  with  tempera¬ 
ture.  the  effect  we  observed  is  unlikely  to  be  caused  by  the 
direct  effects  of  temperature  on  methane  supply  or  oxidation. 
The  change  in  soil  temperature  in  our  plots  due  to  the  tem¬ 
perature  treatment  is  negligible.  However,  the  temperature 
treatment  in  our  experiment  has  been  reported  to  signihcantly 
increase  soil  moisture  at  the  time  of  year  at  which  we  sampled, 
due  to  effects  on  the  plant  community  that  alter  water  loss 
from  plant  transpiration  in  the  spring  (60).  It  Is  thus  plausible 
that  the  decrease  in  relative  abundance  we  of>served  with  in¬ 
creased  temperature  is  due  ultim,atcly  to  the  same  mechanism 
as  the  decrease  we  observed  with  increased  precipitation;  an 
increase  in  soil  water  content.  Indeeil,  siiil  water  content  was 
signihcantly  correlated  with  the  relative  abundance  of  type  11 
methanotrophs  (P  =  0.0 17S).  while  other  factors  (ammonium, 
nitrate,  plant  biomass,  net  primary  productivity)  were  not. 

In  addition,  we  observed  a  signiheant  interaction  between 
precipitation  and  temperature,  such  that  the  combined  effect 
of  increased  precipitation  and  temperature  on  type  II  methan¬ 
otrophs  was  less  than  that  expected  by  their  individual  effects. 
It  is  unclear  why  this  might  be.  It  is  not  due  to  iioiiaddilive 
effects  of  temperature  and  precipitation  on  soil  water  content: 
the'rc  was  not  a  signiheant  interaction  between  these  two  fac¬ 
tors  in  reg.ard  to  soil  water  content  in  oiir  study  (data  not 
shown).  One  possible  explanation  is  that  the  negative  effects  of 
soil  moisture  on  methane  diffusivity  are  ameliorated  at  higher 
water  contents  by  an  increase  in  the  proportion  of  anoxic 
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micnisiics  in  ihc*  viil.  lending  to  a  net  increuse  in  methanogen- 
esis.  This  amid  result  in  the  combined  ellects  ot  lempeiature 
and  precipitation  being  less  than  that  expected  by  their  indi¬ 
vidual  effects,  if  when  combined  they  r<tise  the  soil  water  con¬ 
tent  to  a  level  where  the  proportion  of  anoxic  microsites  is 
increased.  This  hypothesis  could  be  tested  in  luturc  work  by 
comparing  the  relative  abundance  of  type  11  methanotroplis  at 
our  site  across  years  that  vary  naturally  in  precipitation. 

The  relative  abuiuhince  of  the  novel  methanotroph  clade 
JRJ  alsti  responded  tt> elevated  precipitation  and  temperature, 
although  in  a  manner  oppttsite  of  lhat  of  lyp^  II  methano- 
trophs.  The  relative  abundance  of  JR2  increased  in  respemse  to 
elevated  precipitation  and  temperature  (Fig.  5).  rather  than 
deci cased,  as  observed  for  type  II  methanotrophs.  Why  might 
JK2  have  respimded  s<i  dillerently  from  clavsical  type  II  me- 
t]ianotio{>hs?  If  the  methanotrophs  in  JR2  are  atmospheric 
methane  ** specialists,”  as  suggested  by  their  ass(K*iation  (al¬ 
though  distant)  with  the  ”WB.>FH-A”  clade.  then  they  might 
be  expected  to  out  compete  other  methanotrophs  under  low- 
metiKine  conditions.  Such  conditions  could  be  present  under 
conditions  of  relatively  high  soil  water  content,  such  as  those 
resulting  from  increased  precipitation  or  temperature,  which 
would  reduce  the  diffusion  of  methane  into  the  s<iil.  We  ob 
sened  a  significant  antagonistic  interaction  between  elevated 
precipitation  and  lemperalure.  such  that  Ihe  combined  effect 
on  the  relaiive  abundance  tif  JR2  was  less  than  that  expected 
from  their  individual  elfeets.  Although  il  is  unclear  why  Ihis 
might  be.  it  is  possible  that  it  could  be  due  to  same  mechanism 
suggested  for  type  II  melhanoirophs:  simullaneous  increases  in 
temperature  and  precipitation  increase  soil  moisture  such  that 
metbanogenesis  increases,  increasing  the  methane  supply  and 
1  educing  the  competitive  advantage  of  .IR2.  Again,  this  is  a 
testable  hypothesis. 

Our  «.>bserviitions  of  significant  interactions  among  global 
change  factors  arc  consistent  with  previous  studies  of  global 
change.  For  e.xample.  Shaw  and  colleagues  observed  lhat  an¬ 
tagonistic  interactions  among  global  changes  could  alter  plant 
biomass  at  our  site  (4.S).  Furthermore,  Horz  et  al.  (26)  ob¬ 
served  that  both  the  abundance  and  community  structure  of 
ammonia-irxidizing  bacteria  at  our  site  wore  altered  by  antag¬ 
onistic  interactions  among  global  change  factors,  including  in¬ 
teractions  between  temperature  and  precipitation. 

Final  conclusions.  Our  study  expands  our  understanding  of 
the  diversity  of  naturally  occurring  pmo/d  gene  types.  The  high 
number  of  novel  pmoA  clades  we  detected  was  possible  Ixr- 
caiisc  of  our  use  of  combinations  of  different  PC'R  primers  in 
a  nested  and  multiplex  manner.  Since  most  other  p/zio/f -based 
studies  have  relied  on  the  use  of  only  one  primer  set.  it  is 
plausible  that  the  novel  clades  we  t>bserved  are  present  in  other 
environments  as  well  bill  have  been  overkK>ked  due  to  the 
primer  set  used. 

Using  this  appr(.>ach.  we  not  only  discovered  novel  pwoA 
clades,  which  evolutionary  and  sequence  analyses  suggest  are 
functional,  hut  we  also  observed  lhat  at  least  one  such  clade 
responded  to  simulated  multifactorial  global  change  in  a  very 
different  manner  than  classic  type  II  methanotrophs.  To  our 
knowledge,  this  is  the  first  study  that  denionsi rates  that  signif¬ 
icant  changes  in  the  community  structure  of  methanotrophs 
can  occur  in  response  to  multifaclorial  glolval  change.  It  is  not 
yet  known  how  widespread  such  responses  arc.  how  such  re 


sponses  may  vary  through  lime,  or  the  relationship  between 
such  changes  and  ecosystem  function.  Nonetheless,  our  results 
demonstrate  that  methanotrophs  can  be  altered  by  glt>bal 
changes  and  that  multilactorial  e.xperimental  approaches  may 
be  iiecessaiy  to  fully  assess  the  amiplexiry  of  these  respt>nses. 
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Community  Genomics  Among 
Stratified  Microbial  Assemblages 
in  the  Oceans  Interior 

Edward  F.  DeLong/*  Christina  M.  Preston,^  Tracy  Mincer/  Virginia  Rich/  Steven  }.  Hallam/ 
NielS'Ulrik  Frigaard/  Asuncion  Martinez/  Matthew  B,  Sullivan/  Robert  Edwards/ 

Beltran  Rodriguez  Brito/  Sallie  W.  Chisholm/  David  M.  Kar/ 

Microbial  life  predominates  in  the  ocean,  yet  little  is  known  about  its  genomic  variability, 
especially  along  the  depth  continuum.  We  report  here  genomic  analyses  of  planktonic  microbial 
communities  in  the  North  Pacific  Subtropical  Gyre,  from  the  ocean's  surface  to  near-sea  floor 
depths.  Sequence  variation  in  microbial  community  genes  reflected  vertical  zonation  of  taxonomic 
groups,  functional  gene  repertoires,  and  metabolic  potential.  The  distributional  patterns  of 
microbial  genes  suggested  depth-variable  community  trends  in  carbon  and  energy  metabolism, 
attachment  and  motility,  gene  mobility,  and  host-viral  interactions.  Comparative  genomic  analyses 
of  stratified  microbial  communities  have  the  potential  to  provide  significant  insight  into 
higher-order  community  organization  and  dynamics, 


MicTobial  plankton  arc  centrally  involved 
in  tliLxcs  of  energy  iind  rruitter  in  the  sea. 
yet  their  vertical  distribution  and  tiinc- 
tional  vnriahility  in  ihc  ocean's  interior  is  .still  only 
p(X)rly  known.  In  c<xitra.st.  the  venial  zonation  of 
eukaryotic  phytoplankton  and  zooplankton  in  ihc 
ocean’s  Wiitcr  column  hiis  been  well  documented 
for  over  a  ccTituiy  (7).  In  the  photic  zone,  steep 
gradients  of  light  quality  and  intcasity,  taiipcralurc. 
ami  macronutnent  and  trace-metnl  concentmnons 
all  influence  .species  distributioas  in  tlx;  water 
column  (J).  At  greater  dcptiis,  low  taiipcraturc. 
increasing  hy<ln)sfciiic  pressure,  the  dis:ippe:irance 
of  light,  and  dwindling  cncigy  suf^lics  largely 
determine  vertical  siratiTvcation  of  oceanic  bioti. 

For  a  few  prokaryotic  groups,  vertical  distiib- 
ulioas  and  depth- variable  physiological  properties 
arc  hecrxning  known  Genotypic  and  phenotypic 
properties  of  stratified  Prochh/xK’OLXvs  “ecotypes” 
for  example,  are  suggestive  of  depth-variable 
adapUition  to  lig^t  intensity  and  nutricot  availabil¬ 
ity  .T).  In  the  abyss,  the  vertical  zonation  of 
<ieep-sea  pieAiphilic  hiKterta  can  he  expbnnerl  in 
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pm  by  ilvir  obligate  growth  rvxjuirement  for 
eic^'atc‘d  hydrosuitic  pressures  (ri).  In  addilicxi. 
recent  ciiltivation-indcpcndeni  {7  IS)  surveys  have 
shown  vertical  zonation  patterns  among  spe¬ 
cific  groups  of  planktonic  Bacteria,  Archaca. 
and  Eukaiya.  Despite  recent  progress  however, 
a  comprehensive  description  of  the  biological 
properties  and  vertical  distributions  of  plank¬ 
tonic  microbial  species  is  far  from  complete. 

Cultivation-independent  genomic  surveys 
represent  a  potentially  useful  approach  for  char¬ 
acterizing  itatural  micn)bial  as.scmblages  ( /(5,  / 7). 
“Sholgiin”  setjiiencing  and  whole  genome  assem¬ 
bly  frixn  mixed  microbial  assemblages  has  been 
attempted  in  .several  environments,  with  varying 
success  {IS.  19).  In  addition,  fringe  et  al.  (20) 
conipurcd  sliotgun  scc|uences  of  several  disfwraic 
micrt)buil  axsemWages  to  ideniify  community- 
specific  patterns  in  gene  distributions.  Metabolic 
reconstruction  has  also  been  attempted  with  cn- 
vironmenud  genomic  approaches  {21).  Never¬ 
theless,  integral ed  genomic  surveys  of  micmhial 
communities  along  well-defined  envinmmentiil 
gradients  (such  as  the  ocean's  water  column) 
have  not  been  reported. 

To  provide  genomic  perspeciive  on  microbial 
biology  in  the  ocean's  vertical  dimension,  wc 
cloned  large  [^36  kiloba.se  pairs  (kbp)]  DNA 
fragments  from  microbial  communities  at  dififer- 
ent  depths  in  the  North  Pacific  Subtropical  Gyro 


iNPSG)  at  the  opcn-occan  timc-scrics  station 
.M.OII A  |33),  The  vertical  distnbution  of  micro¬ 
bial  genes  from  the  ocean's  airfhc'e  to  ahys,sal 
depths  was  detemtined  by  shotgun  sequencing  of 
fijsiiiid  clofX'  tcniiini.  AppK’ing  identical  aillcction. 
cloning,  and  scqiK* *ncing  strategics  at  seven  depdts 
I  ranging  fiom  10  ni  to  4(X)0  m).  vve  archived 
large-iasert  genomic  libianes  friMn  each  depth- 
stratified  miCTobial  community.  Bidinxtional  DN.A 
sequencing  of  fosmid  cIoiks  (~t0.1)00  sequences 
per  riepth)  and  c*omp;inilive  se<|iienoe  analyses 
werc  used  to  identify  taxa.  genes,  and  metabolic 
pathways  that  characterized  vertically  stratified 
microbial  assemblages  in  ihc  water  column. 

Study  Site  and  Sampling  Strategy 
Our  sampling  site,  Hawaii  Ocean  Timc-scrics 
(HOT)  station  ALOHA  (22‘^d5'  N,  I58^W). 
represents  one  of  the  most  comprehensively 
cliiiractcrizcd  sites  in  the  global  ocean  and  has 
been  a  fixal  point  fi>r  time  series  oriented  iKcan- 
ogiaphic  studies  since  l‘)88  (22).  HOT  inves¬ 
tigators  have  produced  high-quality  spatial  and 
time-series  measurements  of  ihe  defining  physi¬ 
cal,  chemical,  and  biological  oceanographic  pa¬ 
rameters  from  surlacc  w  aters  m  the  scafl<x)r.  These 
detailed  spaiial  and  lemporal  daiasets  present 
unique  opportunities  for  placing  microbial  ge¬ 
nomic  depth  pn)files  into  appropriate  <x:eano- 
graphic  context  (22  24)  and  leverage  these  data 
to  formulaic  nxraningful  ccobgical  hypotheses. 
Sample  depths  were  selected  on  the  basis  of 
well-defined  physical,  chemical,  and  biotic  char¬ 
acteristics.  to  reprcscnl  discrcie  fl)ncs  in  Uic  waicr 
column  (Tables  1  ami  2,  I'ig  1;  llgs.  .SI  and  S2). 
Specifically,  seawater  samples  from  the  upper 
euphotic  zone  (10  m  and  70  m),  the  ha.se  of  the 
chlorophyll  maximum  (130  m),  below  the  base  of 
rile  euphotic  amx:  (2IX)  m),  well  bckiw  tlie  ujipcr 
nxfsopeliigic  (^K)  m\  in  the  core  of  the  dissolved 
axygen  minimum  layer  (770  m).  and  in  the  deep 
abyss.  750  m  above  the  seafloor  (4(XX)  m),  vverc 
cdleTtcd  for  preparing  mieiobial  community  DN.A 
libraries  (Tables  I  and  2.  Fig.  I;  figs.  SI  and  S2). 

The  ilepth  vaniibility  ofgpne  distnbiitions  was 
examined  by  random,  bidirectional  cnd-soqucncing 
of  —  50(X)  fosmkls  from  each  depth,  yielding  -64 
Mbp  of  DNA  sequence  imal  from  ihe  4.5  Gbp 
archive  (Tabic  I ).  This  represents  raw'  sequence 
coverage  of  about  5  (1.8  Mbp  sized)  genome 
equivalents  per  depth.  Because  we  surveyed 
-180  Mbp  of  cloned  DNA  (5000  clones  by 
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-36  kbp  clonc  per  depth).  hoN\c\er.  wcdiaxlly 
Miniplcd  -  l(X)  ^erKutie  equivalents  at  ctieli  depth. 
VVe  did  not  sequence  as  deeply  m  each  sample 
as  .1  leeent  S;irgasst)  .Sea  survey  ( (9).  wheiv  fixMii 
ki  NMMXK)  setjuences  were  ohUitnetl  fnHii 
small  DNA  insert  clones,  frotn  each  of  seven  dif¬ 
ferent  surface-water  siunples.  We  hypothesized, 
however,  th.it  our  eomp;inson  of  mk-Tohuil  a»m- 
numities  collected  along  well-detined  environ- 
meiiUil  gradients  (using  birgc-inscrt  D\.\  eknies). 
would  t’acilitute  detection  of  ecologically  meanmg- 
ful  UiNiinoniie.  fiuKtional.  and  community  tremis. 

Vertical  Profiles  of  Microbial  Taxa 
Vertical  distnhuiions  of  bacterial  gmiips  were 
assessed  hy  amplifying  and  sequencing  small 


subunit  (SSL  )  ribosormil  RN.-\  (rRN.\)  genes 
(Voni  complete  fosmid  library  ptnils  at  each 
depth  (Fig.  2,  fig.  S3).  Bacterial  phylogenetic 
distributions  were  generally  eoiisistent  with 
previous  polymerase  chain  reaction  based 
cultivation-independent  rRNA  surs  eys  of  ma¬ 
rine  picoplankton  (<V.  /,L  25).  In  surface-water 
samples.  rRNA-conunning  fosmids  included 
tlx)Sc  from  PnK'hhnKtKCUs:  yemtcomicio- 
hidlc's:  h  Icxilxivtcnu  ■c'oe;  Ga  m  ni  aproteoba  clcria 
(SARy2.  OM6<).  SAR86  eludes):  Alphaprotco- 
biieteria  (SARI  16,  0.\175  eludes);  and  Delut- 
proteobaeleria  (OM27  ekide)  (Fig.  2).  Bacterial 
groups  from  deeper  waters  ineluded  nKmbers 
0  f  IX'fi'rrilHU  UercM  Plain  toiin  ceun  coc:  .1 1  ’iJo- 
hiu  •U’rialcs\  Gen  munanu  nuuhceav\  Silrasfmui : 


Table  1.  HOT  samples  and  fosmid  libraries.  Sample  site,  22^45'  N.  158  W.  All  seawater  samples 
were  pre-filtered  through  a  1.6*um  glass  fiber  filter,  and  collected  on  a  0.22*Mm  filter.  See  (35)  for 
methods. 


Depth 

Sample 

Volume  filtered 

Total  fosmid 

Total  DNA  (Mbp) 

(m) 

date 

(liters) 

clones 

Archived 

Sequenced 

10 

10/7/02 

40 

12,288 

442 

7.54 

70 

10/7/02 

40 

12,672 

456 

11.03 

130 

10/6/02 

40 

13,536 

487 

6.28 

200 

10/6/02 

40 

19,008 

684 

7.96 

500 

10/6/02 

80 

15,264 

550 

8.86 

770 

12/21/03 

240 

11.520 

415 

11.18 

4,000 

12/21/03 

670 

41,472 

1.493 

11.10 

AUvi  oinoitaf.taeceifc\  and  SAR202.  S.\R1 1.  and 
Agg47  planktonic  baetenal  eludes  (Fig.  2,  fig. 
S2).  Large-insert  DNA  clones  previously  recov¬ 
ered  fnim  tlic  man  lie  cnvironmciit  (9,  10)  also 
pnistde  a  giKxl  inetnc  li»r  taxonomic  assessment 
of  indigenous  microbes.  Accordingly,  a  relatively 
large  proportion  of  our  slnitgun  R>smid  scquciKXis 
nxist  closely  nuitchcd  rRN.A-aintaimr^  Kiete- 
noplankton  aniticiul  clones  previously  recovered 
fmm  the  manne  enviromiK'nt  (fig.  S3). 

Ta.\onomic  bins  of  boetcnul  proteui  homologs 
UxukI  in  raixkmily  sequenced  fosniul  ends  ( Fig.  2; 
lig.  S4)  also  retleeted  disUibuiional  paitenis 
geneially  eoi’isi  stent  wath  previoius  surveys  ui  the 
waier  Lxilumii  (tV.  /S).  Unex|xx‘iedly  large  aonxints 
of  phi^K  DNA  were  recovered  in  clones,  piimcii- 
larly  in  the  phtxie  zone.  Also  uncx(vctcd  was  a 
relilively  high  pnjportKjn  of  Bct;ipn>taiKictenu- 
likc  sequences  recovered  at  130  m.  nxjst  shiuing 
highest  similarity  to  pn>lein  hoimlogs  from 
Hhxkifi.'iux  fiiriix'diKt'ns  As  expeeleil  representa¬ 
tion  of  PixK'hliVxx'iKXits-VkQ  aixl  PekisUxKia-VkQ 
gciioiiiie  scquciK'CN  was  liigli  «i  i1k  i>hotie  /oix\  .At 
greater  ikpths.  higher  pniportions  of  C'hknnfU'xi- 
like  sequences,  periiaps  eofresponding  to  iIk'  eo- 
ix'ciirring  SAR2()2  clasle.  were  obserxetl  (I  ig  2). 
PlniK'i<wnx'etales-\ikc  genomic  DNA  sequences 
wcie  alsi»  higlily  represeiiied  al  gi\:;ilet  dqxlis. 

All  urctxieat  S.Slt  rRNA  eoniaining  U>sii>itK 
wen;  klentiticd  at  each  depth,  quantified  by  mae- 
niarray  hybritlizalion.  and  their  rRN.\.s  scqixMKvd 


Table  2.  HOT  sample  oceanographic  data.  Samples  described  in  Table  1. 
Oceanographic  parameters  were  measured  as  specified  at  (49);  values  shown 
are  those  from  the  same  CTD  casts  as  the  samples,  where  available.  Values  in 
parentheses  are  the  mean  ^  1  SO  of  each  core  parameter  during  the  period 
October  1988  to  December  2004,  with  the  total  number  of  measurements 
collected  for  each  parameter  shown  in  brackets.  The  parameter  abbreviations 
are  Temp.,  Temperature;  Chi  a,  chlorophyll  a;  DOC,  dissolved  organic  carbon; 
N-l-N,  nitrate  plus  nitrite;  DIP,  dissolved  inorganic  phosphate;  and  DIC, 


dissolved  inorganic  carbon.  The  estimated  photon  fluxes  for  upper  water 
column  samples  (assuming  a  surface  irradiance  of  32  mol  quanta  m  ^  d  ^ 
and  a  light  extinction  coefficient  of  00425  m  ')  were:  10  m  20.92  (65% 
of  surface),  70  m  -  1.63  (5%  of  surface),  130  m  -  0.128  (0.4%  of  surface), 
200  m  0.07  (0.02%  of  surface).  The  mean  surface  mixed-layer  during  the 
October  2002  sampling  was  61  m.  Data  are  available  at  (50).  'Biomass 
derived  from  particulate  adenosine  triphosphate  (ATP)  measurements  as¬ 
suming  a  carbonJlTP  ratio  of  250.  ND,  Not  determined. 


Depth 

(m) 

Temp. 

(  C) 

Salinity 

Cht  a 
(ug/kg) 

Biomass* 

(gg/kg) 

DOC 

(p  mol/kg) 

N  N 
(nmol/kg) 

DIP 

(nmol/kg) 

Oxygen 

(pmol/kg) 

otc 

(pmol/kg) 

10 

26.40 

35.08 

0.08 

7.21  t  2.68 

78 

1.0 

41.0 

204.6 

1,967  6 

(24.83  ♦  1.27) 

(35.05  ^  0.21) 

(0.08  *  0.03) 

178) 

(90.6  14.3) 

(2.6  ^  3.7) 

(56.0  ’  33.7) 

(209.3  ‘  4.5) 

(1,972.1  t  16.4) 

(2,104) 

11.611) 

1320) 

1140) 

(126) 

(146) 

1348) 

1107) 

70 

24.93 

35.21 

0.18 

8.51  ±  3.22 

79 

1.3 

16.0 

217.4 

1,981.8 

(23.58  r  1.00) 

(35.17  -  0.16) 

(0.15  0.05) 

186) 

(81.4  •  11.3) 

(14.7  -  60.3) 

(43.1  r  25  1) 

(215  8  *  5.4) 

(1,986.9  ±  15.4) 

11.202) 

11.084) 

1363) 

179) 

(78) 

(104) 

1144) 

[84] 

130 

22.19 

35.31 

0.10 

5.03  t  2.30 

69 

284.8 

66.2 

204.9 

2,026.5 

(21.37  t  0.96) 

(35.20  X  0.10) 

(0.15  t  0.06) 

190) 

(75.2  ?  9.1) 

(282.9  *  270.2) 

(106.0  t  49.7) 

(206  6  ♦  6.2) 

(2,013.4  t  13.4) 

11.139) 

|980) 

(350) 

(86) 

(78) 

(68) 

1173) 

(69) 

200 

18.53 

35.04 

0.02 

1.66  ^  0.24 

63 

1,161.9  ‘  762.5 

274.2  -  109.1 

198.8 

2,047.7 

(18.39  *  1.29) 

(34.96  ^  0.18) 

(0.02  -  0.02) 

12) 

(64.0  -  9.8) 

(7) 

(84) 

(197.6  *  7.1) 

(2,042.8  ^  10.5) 

(662) 

1576) 

197) 

(113) 

(190) 

1125) 

500 

7.25 

34.07 

NO 

0.48  0.23 

47 

28,850 

2,153 

118.0 

2197.3 

(7.22  t  0.44) 

(34.06  ±  0.03) 

1107) 

(47.8  t  6.3) 

(28,460  X  2210) 

(2,051  t  175.7) 

(120.5  t  18.3) 

(2,200.2  i  17.8) 

11,969) 

(1.769) 

(112) 

(326) 

(322) 

1505) 

(134) 

770 

4.78 

34.32 

ND 

0.29  0.16 

39.9 

41,890 

3,070 

32.3 

2323.8 

(4.86  -  0.21) 

(34.32  t  0.04) 

(107) 

(41.5  ±  4.4) 

(40,940  *  500) 

(3,000  *  47.1) 

(27.9  ±4.1) 

(2,324.3  t  6.1) 

(888) 

1773) 

(34) 

1137) 

1135) 

1275) 

134) 

4,000 

1.46 

34  69 

ND 

ND 

37.5 

36,560 

2,558 

147.8 

2325.5 

(1.46  X  0.01) 

(34.69  t  0.00) 

(42.3  ±  4.9) 

(35,970  t  290) 

(2,507  t  19) 

(147.8  ±  1.3) 

(2,329.1  t  4  8) 

1262) 

1245) 

(83) 

(108) 

1104) 

1210) 

128) 
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Fig.  1.  Temperature  versus  salinity  (T-S)  relations  for  the  North  Pacific  Subtropical  Gyre  at  station 
ALOHA  (22“4S'N,  ISS  ’W).  The  blue  circles  indicate  the  positions,  in  T-S  "hydrospace"  of  the  seven 
water  samples  analyzed  in  this  study-  The  data  envelope  shows  the  temperature  and  salinity 
conditions  observed  during  the  period  October  1988  to  December  2004  emphasizing  both  the 
temporal  variability  of  near -surface  waters  and  the  relative  constancy  of  deep  waters. 


(tigs.  S5  and  S6),  The  gcncml  piittcms  of  archacal 
distribution  sve  i)bsLTvcxl  uen:  consistent  with  pre¬ 
vious  tiekl  sun-cyst  15.  J5, 26).  RccoNery  of  “group 
11“  plankuxiic  Hunanhu^ffa  gctKMiilc  DNA  was 
gmiteM  in  the  upper  water  ci>lumn  and  declmeil 
below  the  photic  zone.  This  dLstribirtion  cotrob- 
omles  recent  observations  of  ion-transloeating  pho- 
topnrtcins  (Cidled  proteorhodopsins),  now  known 
to  occur  in  gixxip  U  Eturatvlhwofa  inhabiting  the 
photic  zone  (27).  “Gnxip  III*'  EiiniuvluK'oki  DNA 
was  teeoNered  at  all  depths,  but  at  a  much  lower 
fiequcncy  ( figs.  S5  and  S6).  A  novel  crenarchacal 
group,  closely  related  to  a  puuitively  thermophilic 
Crakitvlmeoiu  {2f<),  was  observed  at  die  greatest 
depths  ( fig.  S6). 

Vertically  Distributed  Genes 
and  Metabolic  Pathways 
Tlic  depths  sainplc'd  were  spcxitically  chosen  to 
capnire  miembial  sequences  at  di.scrctc  hiogco- 
chemieal  zones  in  i he  water  column  encompa.ssii^ 
key  physicochemical  features  (Tables  I  and  2. 
Fig.  I.  figs.  SI  and  .S2).  To  cvaloatc  scqocnces 
from  eiich  depth.  Idsmtd  end  seqrienc’es  were 
compared  against  different  databases  including 
the  Kyoto  tncyclopedia  of  Cienes  ami  Cienomes 
(KEGG)  (29).  National  Center  for  Biotcchnolo©' 
Infonnation  |NCBI)’s  Clusters  of  OrtfKdogvsis 
(inxips  (C(Xj)  (Jft).  imd  SLI  I)  subsystems  ( J/). 
After  eatc'gorizing  sequences  ftum  each  depth  in 
BLAST  searches  (J2)  against  each  databa.se,  we 
idenlilicd  pn)tein  categories  that  wctc  more  or 
less  well  rcprescntc'd  in  one  sample  versus  an¬ 
other,  using  cluster  arndysis  (JJ.  J4)  and  Nxit- 
strap  resampling  methodologies  (JJ). 

Cluster  analyses  of  predicted  protein  sequence 
repmscmuition  idcntilkxl  specific  genes  and  nKUi- 
bolie  traits  that  were  differentially  distributed  in 
the  water  column  (fig.  S7).  In  the  photic  zone  ( 10, 
70.  and  130  m).  the.se  included  a  greater 
representation  in  sequences  a.ssociatcd  with  pho¬ 
tosynthesis;  p(jrphyrin  and  chlorophyll  mettbo- 
lism:  type  III  secretion  syactirs;  and  aminosugars, 
purine,  propom)atc.  aixl  vifcimin  BO  metabolism, 
relative  to  deep-water  samples  (fig.  S7).  Indepen- 
dart  comparisons  with  well-annotated  sobsystems 
in  the  SBFD  database  (3/)  also  showed  similar 
and  overiappmg  trends  (tihle  SI),  including 
greater  rqircscntation  in  photic  zone  sequences 
associated  with  alanine  and  aspartate:  metabolism 
of  aminosugars;  chlorophyll  and  carotenoid 
hiosyntlicsis;  maltose  traixsporU  lactose  degrada¬ 
tion;  and  heaxy  metal  ion  sensors  and  exporter;. 
In  contrast,  samples  from  depths  of  200  m  and 
below  (where  there  is  no  photosynthesis)  were 
enriched  in  difleront  seqoeTx;es,  including  those 
asstxialcd  witli  pnilein  folding;  pr<x;essing  and 
export;  methionine  metaKilism;  glyo.xylate,  dicar- 
boxylalc,  and  methane  metabolism;  thiamine 
metabolism;  and  type  11  secretion  systennss  relative 
to  sortiice- water  siimplc*s  tlig.  S7). 

C(Xi  categories  also  [xovided  insight  into 
different iaily  distributed  protein  fiinction.s  and 
categories.  COCjs  more  highly  represented  in  photic 
zone  includcxl  iron-transport  membrane  receptors. 


dco.\yribop\Tiniidinc  photolyasc,  diaminopimelatc 
decarboxylase,  nx-Ttibrane  guanosiix:  triphosplia- 
nise  (GTPd.se)  w  ith  the  lysyl  endopeptukise  gene 
proiKiet  LepA  and  brane hod-chain  amino  acid 
transport  system  components  I  fig.  S.S).  In  con¬ 
trast.  COGs  with  greater  representation  in 
deep-water  samples  included  transposascs.  sev- 
cTdl  dehydrogeniise  categories,  and  integrases 
(fig.  S8).  Sequences  more  highly  represented  in 
the  deep-water  samples  in  St.KD  sohsystem  (J/) 
comparisons  incloded  those  associated  wTih 
respiratory  dehydrogenases,  polyaniinc  adeno¬ 
sine  triphosphate  (ATP)  binding  cas.sette  (ABC) 
transporters,  polyaminc  metabolism,  and  alkyl - 
phosphonalc  transporters  (table  Si ). 

Hahitiit-cnrichcil  .sequences.  We  estimated 
average  protein  sequence  similarities  between  all 
depth  bins  from  cornu lativc  TBLASTX  high- 
scoring  sequence  piiir  (MSP)  hrtscorcs,  denvetl 
from  BLAST  searches  of  each  depth  against 
every  other  (Fig.  3).  Neighbor-joining  analyses 
of  a  normalized,  distance  matrix  derived  from 
these  cumulative  hitscorcs  joined  plKXic  zaine 
and  deeper  samples  together  in  sepiirate  clusters 
(Fig.  3).  Whc'n  wc  compared  our  HOT  sequence 
datasets  to  previously  reported  Sargasso  Sea 
microbial  scqoencxjs  (/y),  these  datasets  also 
clustered  according  to  their  dc^ith  and  size 
fraction  of  origin  (fig.  S9),  The  clustering 
pattern  in  Fig.  3  is  consistent  with  the  ex¬ 
pectation  that  randomly  sampled  photic  zone 
microbial  sc‘qiiencc*s  will  tend  on  average  lo  be 
more  similar  to  one  another,  tliiin  to  those  from 
the  dccp-sca,  and  vice-versa. 

W'e  also  identified  ihose  sequences  (some  of 
which  have  no  homologs  in  amxitatcd  database's) 


that  track  majo"  depth-variable  cnviixmmental 
features.  Specifically,  scxiuenoc  homologs  found 
only  in  the  photic  zone  iiraqoe  sequences  (from 
10,  70.  and  130  m).  or  deepwater  unique 
sequences  (from  500.  770.  and  4000  m)  were 
identified  (Fig.  3).  To  categorize  potential 
functions  encoded  in  these  photic  zone  unique 
(PZ)  or  deep-water  unique  (DW)  scxiucnce 
bins,  each  was  compared  with  KEGG.  COG. 
and  NCBI  protein  databases  in  separate  analy¬ 
ses  (2^,  30.  S6). 

Some  KEGG  metabolic  pathways  appeared 
more  highly  represented  in  the  PZ  thin  in  DW' 
sequence  bins,  including  those  associated  with 
photosynthesis;  porphyrin  and  ehkvtphyll  metab¬ 
olism;  propanoate.  purine,  and  glycerphospholipid 
mctabolLsm:  bacterial  chemotaxi.s;  flagellar  as.scm- 
Wy:  and  type  111  secretion  systems  (Fig.  4A).  All 
prote4>fhtHkpstn  sequences  (except  one)  were 
captured  in  the  PZ  bin.  W'cll-rcprescnicd  phoiie 
zone  KF.GG  pathway  categories  appeared  to  re- 
flea  potential  pathway  interdependencies.  For 
example  the  PZ  photosynthesis  hin  (3%  of  the 
total  (Fig.  4A)]  amtained  PnKiUmtKiKXUs-X'ke 
and  -Siwee/Kx-ocriis-likc  photosystem  1.  photo- 
system  H,  and  cytochrome  genes.  In  tandem. 
PZ  porphyrin  and  chlorophyll  biosynthesis  se¬ 
quence  bins  ['-3.9*4  of  the  total  (Fig.  4A)1  con¬ 
tained  high  representation  of  eyanoh;icteria-like 
cobalamin  and  chlorophyll  biosynthesis  genes,  as 
well  as  photoheterotroph-like  bacteriochloro- 
phyll  biosynthetic  gciKs,  Other  probable  func¬ 
tional  interdependencies  appear  rcflcdcd  in  the 
corccovciy  of  sequences  a.ssociatcd  wiih  chc- 
mota.\ls  (mostly  methyl-accepiing  chemotaxis 
proteins),  flagellar  biosynthesis  (predominant- 
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Fig.  2.  Taxon  distributions  of  top  HSPs.  The  percent  top  HSPs  that  match 
the  taxon  categories  shown  at  expectation  values  of  <1  x  10  Values 
in  parentheses  indicate  number  of  genomes  in  each  category,  complete 


or  draft  that  were  in  the  database  at  the  time  of  analysis.  The  dots  in  the 
lower  panel  tabulate  the  SSU  rRNAs  detected  in  fosmid  libraries  from 
each  taxonomic  group  at  each  depth  U5)  (figs.  S3  and  S^). 


ly  tlajjcllar  motor  and  hook  protciivcncoding 
genes),  and  type  III  secretory  pathways  (all 
associated  with  flagellar  biosynthesis)  in  PZ 
(Fig  4A). 


DW  sequences  were  enriched  In  several 
KKCiti  categories,  including  glyoxylate  and  dicar- 
bo.\ylatc  metabolism  (with  high  representation 
of  I soe i tra tc  lyase  and  ft>r ma te  dehydrogenase 


like  genes):  proectn  folding  and  processing  (pre¬ 
dominantly  ehapcroitc  and  protease  like  gcfKs): 
type  11  secretory  genes  ( '-40°  b  were  most  sim¬ 
ilar  to  pilin  bli>synthesis  genes);  aminoplnispho- 
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Fig,  3,  Habitat-specific  Photic  zone  unique  soqucnccs 

sequences  in  photic  zone 
versus  deep-water  communi¬ 
ties.  The  dendrogram  shows 
a  cluster  analysis  based  on 
cumulative  bitscores  de- 
from  redprocal  TBLASTX 
compahsons  between  all 
depths.  Only  the  branch¬ 
ing  pattern  resulting  from 
neighbor-joining  analy¬ 
ses  (not  branch-lengths) 
are  shown  in  the  dendro¬ 
gram.  The  \fefin  diagrams 
depict  the  percentage  of  se¬ 
quences  that  were  present 
only  in  PZ  sequences  (n 
12,713)  or  DW  sequences 
in  -  14,132),  as  deter¬ 
mined  in  reciprocal  BLAST 
searches  of  all  sequences  in 
each  depth  versus  every 
other.  The  percentage  out 

ol  the  total  PZ  or  DW  sequence  bins  represented  in  each  subset  is  shown.  See  SOM  for  methods  OS). 


nate,  methionine,  and  sultlir  metabolism;  buta- 
noate  metabolism;  ion-couplcd  transporters; 
and  other  ABC  transporter  variants  (Hig.  4B). 
The  high  representation  in  f)W  sequences  of 
type  11  secretion  system  and  pilin  biosynthesis 
gales,  polysacehariite,  and  antibiotic  synthesis 
suggest  a  polentiiilly  greater  nile  for  surfiice- 
a.ssix:iatc<l  microbial  prcKcsses  in  ilic  tleqKT- 
water  communities.  C'onversely,  enrichment  of 
bacterial  motility  and  chemota.\is  sequences  m 
the  photic  zone  indicates  a  potentially  greater 
importance  for  mobilily  and  response  in  these 
assemblages. 

Similar  differential  patterns  of  scqticncc 
distribution  were  seen  in  CXXi  categories  (Fig. 
4B).  COGs  enriched  in  the  PZ  scqueiKC  bin 
includcxJ  phololyascs,  iron-transport  outer  mem¬ 
brane  proteins.  Na  *  -driven  cffltix  pumps,  ABC‘- 
type  sugar-transport  systems,  hydrolases  and 
acyl  transferases,  and  transaldolases,  fn  deeper 
waters,  transposascs  were  the  most  enriched 
CfXT  category  ('“4.5”«  of  tlic  CGG-categorizcd 
DW).  increasing  steathly  in  representation  w-ith 
depth  from  500  ni  to  tlicir  observed  ma.\imiini 
at  4(MX)  m  (Fig.  4B;  fig.  ,S9).  Transposa.ses 
represented  one  of  the  single-most  overrepre¬ 
sented  COG  categories  in  deep  waler.s.  ac¬ 
counting  for  1 .2%  of  all  fosmitls  sequenced 
from  4000  m  (fig.  S8).  Preliminary  analyses 
of  the  transposasc  variants  and  mate-pair  se¬ 
quences  indicate  that  thc'y  represent  a  wide 
variety  of  different  transp<isasc  families  ami 
originate  from  diverse  microbial  taxa.  In  con¬ 
trast,  other  highly  represented  COG  categories 
appeared  to  reflect  specific  taxon  distribution 
and  abundance's.  For  example,  ihe  enrichment 
of  transakiolascs  at  70  m  (Fig.  4B;  fig.  S9)  w'crc 
mostly  derived  from  abundant  cyaixiphage  DNA 
that  was  recovered  at  that  depth  (see  discus¬ 
sion  below). 


Sargasso  Sea  surface-water  miemhial  se¬ 
quences!/ 5))  shared,  as  expected,  many  more 
homologous  scxjuaiCLN  with  our  photic  /one 
sequences  than  ihose  fmm  ihe  deep  sea  ffig. 
SIO).  There  were  10  times  as  many  PZ  than 
DW'  sequences  shared  in  common  with  Sar¬ 
gasso  Sea  Siimples  5  through  7  {19)  (fig.  SID). 
In  contrast,  PZ-like  scxiticnccs  were  only  three 
times  higher  in  DW  when  compared  with  se¬ 
quences  from  Sargasso  Sea  sample  3  (tig.  SIO). 
The  fact  that  Sargasso  sample  3  was  collected 
during  a  period  of  winter  dcvp-walcT  mixing 
likely  contributes  to  this  higher  representation 
of  DW  -like  homo  logs.  5>argasso  Sea  homologs 
of  our  PZ  sequence  hin  included,  as  expected, 
sequences  associated  with  photosynthesis;  ami¬ 
no  acid  transport;  purine,  pyrimidine  and 
nitrogen  metabolism;  porphyrin  and  chloro¬ 
phyll  metalxilisni;  oxidative  phosphorylation; 
glycolysis,  and  starch  and  sucrose  metabolism 
(fig.  SIO). 

Tentative  taxonomic  a.ssigiiments  of  P/  or 
DW  setjuences  (top  If.SPs  fnim  NC'Bf's  nonre- 
dundant  protein  database)  were  also  tabulated 
(tig.  Sll).  As  expected,  a  high  percentage  of 
Prochhrococcu.f-Wkc  sequences  was  found  In 
PZ  (*~5%  of  tlie  lotal),  and  a  greater  represen¬ 
tation  of  DelUipniteohacteiia-like,  /iclimthtttfena- 
likc  and  Plunciomyceie-]ikc  sequences  were 
recovered  in  DW'.  Unexpectedly,  the  single  most 
highly  reprc’sented  taxon  category  in  PZ  (—21% 
of  all  identified  sequences  in  PZ)  w'as  derived 
from  viral  sequences  that  were  cnptiiretl  in 
fosmid  clone's  (tig.  SI  I ). 

Community  Genomics  and 

Host-Virus  Interactions 

Viiuscs  arc  ubiquitous  and  abundant  compo 

nents  of  marine  plankton,  and  influence  lateral 

gene  transfer,  genetic  diversity,  and  bacterial 


mortality  in  the  water  column  (J*^  40).  The  large 
number  of  viral  DNA  scquaiccs  in  our  diitaset 
was  unexpected  (Fig.  5;  fig.  SI2).  because  we 
expected  plankiomc  viruNes  to  |xiss  through 
our  collection  fillers.  Previous  studies  using  a 
similar  approach  found  only  minimal  eontri- 
bulions  from  viral  sources  ( /9, 40).  The  majority 
of  viral  DNA  we  captured  in  fosmid  clone 
libraries  apparently  originates  from  replicat¬ 
ing  viruses  within  infected  host  cells  (.?.'»). 
Viral  DNA  recovery  was  highest  in  the  photic 
zone,  w  ith  cyanophage-like  sequences  repre¬ 
senting  1  to  Ky’b  of  all  fosmid  sequences  fFig, 
5 ),  and  60  to  80®  o  of  total  virus  sequences  tlicrc. 
Below  2(X)  m.  viral  DNA  made  up  no  more 
than  ()..3%  of  all  sequences  at  each  depth. 
Most  photic  zone  vtral  sequences  shared  highest 
similarity  to  T7-like  and  T4-Iike  cyanophage  of 
the  Podoviridae  and  .Vlyoviridac.  This  is  con¬ 
sistent  with  prev  ions  studies  f  40  •/-’),  suggcsling 
a  wide^^read  distribution  of  these  phage  in  the 
ocean. 

Analyses  of  1 107  fosmid  mate  pairs  pro¬ 
vided  further  insight  into  ihe  ongins  of  the  viml 
sequences.  About  67%  of  the  vimslikc  clones 
were  most  similar  to  cyanophage  on  at  least  one 
end,  and  half  of  these  were  highly  similar  to 
cyanophage  at  both  teiniini.  Many  of  the 
cyanophage  clones  showetl  apparent  synteny 
with  previously  sequenced  cyanophage  ge¬ 
nomes  ( fig.  S 1 2  f.  About  1 1  %  of  the  cyanophage 
paired-ends  contained  a  host-derived  eyano- 
phugc  “signatiiie”  gene  (4J)  on  one  tcniiinus. 
The  frequency  and  genetic -linkage  of  phage- 
cncodcd  fbul  host-derived)  genes  we  observed, 
including  virus-derived  genes  involved  in  pho¬ 
tosynthesis  f  p.sh.i.,  p.\hD.  hfi),  phosphale- 
scavenging  genes  {pholl,  psiS).  a  cobalamin 
biosynthesis  gene  icohS),  and  carbi>n  metabo¬ 
lism  Oran.salififhu.')  supports  their  widespread 
distribution  in  natural  viral  populations  and 
llieir  probable  functional  imiK)rtance  to  cyano¬ 
phage  replication  (43,  44). 

If  w'C  assume  that  the  cyanopliagcs'  DNA  was 
denvied  frorn  Infected  host  cell.s  m  which  phage 
were  replicating,  the  percentage  of  cyanophage- 
infoeted  cells  was  estimated  lo  range  between  1 
and  12%  (35).  An  apparent  cyanophage  infec¬ 
tion  maxima  was  observed  at  70  m,  coinciding 
with  the  peak  virusihost  ratio  fFig.  5).  Although 
these  estimates  are  tentative,  they  are  consistent 
with  previously  reported  ranges  of  phage- 
infected  picoplankton  cells  in  situ  (3fV,  4.5) 

About  0.5%  of  all  sequences  were  likely 
prophage,  as  infened  from  high  stxjucncc  sim- 
ilanty  to  phage-related  mtegrases  and  known 
prophage  genes  (3.5).  Paired-end  analyses  of 
viral  fosmids  indicated  that  —2.5%  may  be 
derived  from  prophage  integrated  into  a  variety 
of  host  ta.xa.  A  few  clones  also  appear  to  be 
derived  from  temperate  siphovinises,  and  a 
number  of  ptitativc  eukaryotic  paired-end  viral 
sequences  shared  highest  sequence  identity  with 
homolpgs  from  herpes  viru.ses,  mimivinises, 
and  algal  viruses. 
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C«(l  dtvtsiofi 

Glycln«.  S0rtf>«,  thrvonlne  metatiallam 
Ptnlotnaoalc  and  CoA  Woayniheali 
FfvctoM  and  mannoM  maUbcXtam 
Fatty  acid  matabollam 
Gatactoaa  malaboliam 
POTM  ion  channoit 
AOC  Iranaporlara,  ABC-2,  othart 
Aminophoaphonata  metabolism 
Butanoata  mataboliam 
Sultur  mataboilsm 

Qiyoxyiata  and  dicartxixylate  malaboliam 
Oiner  transcriptional  regulalors 
Lipopolysaccbsridt  biosynihasis 
Olhsf  lon<ouplad  transporters 
(rtosltol  pbosphats  mataboilsm 
Protein  taport 
Type  II  secretion  sysiem 
Matblonlna  matotxjllam 
Aromina  and  proiins  metabolism 
Oaldabva  phosphorylation 
Thiamine  mataboliam 
A8C  Iranaporlara.  aukaryollc 
Proiem  loidind  tnd  processing 
Other  tranalalion  tactora 
Pyrimidine  metabolism 
Riboaomt 

Ures  cycle  and  smlno  group  metaboRsm 
Hialldina  mataboliam 
AmirKMugara  matsbollem 
HTH  lamlly  transcriptional  ragulslorm 
Androgen  end  estrogen  metabolism 
TGr<beia  signaling  ^hwsy 
fCM-racaptor  Inlarscllon 
Starch  and  sucrose  metabolism 
Replication  complex 
ONA  poiymsrase 
Qtycolyais  /  GHiconsogeneats 
Mlrogan  metabolism 
f  oisle  blotynlhesis 
Pyruvate  ntclebollam 
CfirsiB  cycle  [TCA  cycte) 

Translation  feciors 
Bactenel  chemotasis 
QtyccropbosphoMpld  metabolism 
FisgalUr  assembly 
Bloeynthesit  ot  steroids 
ATP  aynthaeie 
propanoate  metsboMsm 
Purine  melsboNsm 
RNA  potymtrase 
Photoeyntheate 

Porphyrin  and  chlorophyll  metabollam 
Type  Ift  aecretion  ayatam 
2,4  Olchlorobensoaie  degradstlon 
TWo<omponenl  ayelem 
Lysine  bloeynthesit 
N-Oiycan  d^radalion 
Pentose  phosphate  pathway 
Phenylalanine,  lyroelnc,  tryptophan  biosynthesis 
UblqulrKrMi  blosynl  heels 

With  st^dard  deviations  greater  than  0,2  of  obsen/ed  values,  having  at  least 
shown. 


ASC-rype  sugar  iranapoa  pa<l|rtesjmc 
NiM-dfIvsn  muMdrug  amui  pump 
Outer  mamvana  |Vo***ns.  Fa  oanaport 
DMA  or  RNA  hoNcascs  of  SupeHamIty  N 
Oeoiiyribodlpyrlmtcttna  phololysse 
NuctaosMe-dtphoaphtW- sugar  apesaraaes 
Pradictad  parmaaaas 
FOG  TPfl  rapaal 

Pradiciad  hydrolaaaa  or  acyHranatsrasas 

FOO  PAbiTAC  domatn 

Signal  transduction  hiatkltna  Unaaa 

Giydna/O-emino  scld  oxidases 

Pradciad  Iranscrtpllonti  ragiJalors 

Transcriptional  raguMors 

7  Parmaaaas  ol  tha  dnig/mauibotita  Iransponar 

Pradclad  dahydroganaaaa  and  raiaiad  prolams 

Oehyitraganaaaa  (and  ahory-ehsin  dcehol  daHaarta] 

Ctycoavhrar>slarasa 

Oahydroganasaa  (fUvoprotalrw) 

ABCSypa  snbmicrobial  pepUda  transport 
In-dapandam  hydrolsaas,  XKludlng  giyoxyiaaaa 

Irtvotvod  in  call  waN  btogartasit 


Tranaposass  and  inactlvsiad  darhstivas 
Pradviad  hydtomaas  or  acytlrans^arasas 
Methionine  synthase  •  |  cabal  smtn-IndepandarM} 
Sucomals  dehydroganasaftumarata  reduclaaa 
TRAP,  type  uncharactarixad  transpori 


Fig.  4.  Cluster  analyses  of  KEGG  and  COG  annotated  PZ  and  DW 
sequence  bins  versus  depth.  Sequence  homologs  unique  to  or  shared 
within  the  photic  zone  (10,  70,  and  130  m)  and  those  unique  to  or  shared 
in  DW  (500,  770,  and  4000  m)  were  annotated  against  the  KEGG  or  COG 
databases  with  TBLASTX  with  an  expectation  threshold  of  1  x  10 
Yellow  shading  is  proportional  to  the  percentage  of  categorized  sequences 
in  each  category.  Cluster  analyses  of  gene  categories  (left  dendrograms) 
were  performed  with  the  Kendall's  tau  nonparametric  distance  metric, 
and  the  Pearson  correlation  was  used  to  generate  the  top  dendrograms 
relating  the  depth  series  (33,  34).  Dendrograms  were  displayed  by  using 
self-organizing  mapping  with  the  Pearson  correlation  metric  (33,  34).  Green 
lines  in  top  dendrograms  show  PZ  sequences,  blue  Hnes  DW  sequences.  (A) 
KEGG  category  representation  versus  depth,  KEGG  categories  with  a 
standard  deviation  greater  than  0.4  of  obsenred  values,  having  at  least  two 
depths  ^0,6%  of  the  total  KEGG-categorized  genes  at  each  depth,  are 
shown.  For  display  purposes,  categories  >8%  in  more  than  two  depths  are 
not  shown.  (B)  COG  category  representation  versus  depth.  COG  categories 
two  depths  ^0,3®/o  of  the  total  COG -categorized  genes  at  each  depth,  are 


Ecological  Implications  and 
Future  Prospects 

Microbial  community  sampling  along  wcll- 
chamcien/wt  tlq^lh  slr.ila  alkmat  us  lo  ateniity 
significant  dcpth-vaiiablc  tiUKls  m  gaK  content 
and  metabolic  pathway  components  of  oceanic 
microbial  communities.  The  gene  repertoire  of 
suifacc  w-aiers  rcTIcctal  mxix*  of  llic  nxclianisns 
.11x1  iTXHles  of  lightnlnven  pnxesses  aix!  pnmary 
pnxhjctK  ily.  Ensironmcntally  diagnostic  sequences 
in  surface  waters  included  pitxlictcxf  proteins  as- 
sociatcxl  with  cyanophagc»  motility,  chemotii.x- 
is,  plxitosynthesis,  protcnifxxli  5isins.  photrilyascs, 
carotenoHl  biosynthesis,  inwi-trinsport  systems, 
and  host  rcstrictton -modification  systems  The 
importance  of  light  energy  to  these  communi¬ 
ties  as  rencctcd  in  their  gene  content  was  ob¬ 
vious,  .More  subtle  aophysiologtcal  trends  can 
be  .seen  in  imn  transport,  vitamin  synthesis, 
flagella  synthesis  and  secretion,  and  chemotaxis 
gene  disUnbutions.  Thc'se  data  support  hypothe¬ 


ses  abfnit  potential  aiUiptiNe  strategies  of  het- 
crotrophic  bacteria  in  the  photic  zone  that  may 
acti\ely  compete  for  nutrients  by  swimming 
toward  nutnent-nch  piirtictes  amt  algae  (44)  In 
contHLsi  to  surface-water  as.scnihlagcs,  deep¬ 
water  micnibial  communities  appeared  more 
enriched  in  transposiuscs.  piliLs  synthesis,  protein 
c.x|X7rt,  polysaccharkle  aixl  antibiotic  synthesis, 
the  glyoxylaie  cycle,  ami  ure.i  metalxilism  gene 
sequences.  The  observed  c'nriehnx’nt  in  pilus. 
polysaccharide,  and  antibiotic  synthesis  genes 
in  deeper-water  siimples  suggests  ii  potentially 
greater  role  for  a  surface-attached  life  style  in 
deeper-w'ater  microbial  communities  I  mall v, 
the  apparent  enrichment  of  phage  genes  and 
restriction-modification  systems  observed  in 
the  photic  zone  may  indicate  a  greater  role  for 
phage  parasites  in  the  more  fModitctivc  upper 
water  column,  relative  to  deeper  waters. 

At  finer  scales,  sequence  distributions  we 
observed  alst)  reflected  genomic  '‘niicrovari- 


ahility”  along  envmmmenlal  gradients,  as 
evidenced  by  the  partitioning  of  high-  ami  low 
light  PnK'hU»tK'tK\us  eeotyjx*  gates  ttlvscrsctl 
in  ditTerent  regions  of  the  photic  /one  (I  ig. 
Highcr-ordcr  biological  interactions  were  also 
evident,  for  example  in  the  negative  correlation 
of  cy  arxtphagc  versus  Pnychlot-octKcus  host  gene 
sequence  recovay  (Fig.  5).  This  rvlaiKtn  Ix*- 
tween  the  ahumtmee  of  host  amt  cyanophage 
DN A  probaWy  reflects  spc'citlc  mechanisms  of 
cyaixjphagc  replication  in  situ.  These  host -parasite 
sequence  correlations  we  saw  dcmonstr.ilc  the  pev- 
tcntial  tor  observing  community-lcvx:l  intcrspc- 
cies  mlenictions  Ihntiigh  envinvnmental  gemxnu- 
datasets. 

()bv  iously.  the  abundance  of  specific  ta.xa 
will  greatly  influence  the*  gene  distnbutioiis  ob¬ 
served,  as  wc  saw,  tor  example,  in  Ptxx'filotxxtK- 
<71 V  gene  distnhiition  in  the  p^iotic  /one  Cxmc 
sequence  di.stnhuiions  can  rellcci  more  thiin  just 
relative  abundance  of  specific  ta.\a,  hovvx:vcr 
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Fig,  5,  Cyanophage  and  cyanobactena  dis¬ 
tributions  in  mkrobial  community  DMA.  The 
percentage  of  total  sequences  derived  from 
cyanophage,  total  cyanobacteria,  total  ProcNor- 
ococcus  sppt,  higMight  Prodiorxxoccus,  low- 
light  Prochiorococcus  spp.,  or  Syne<}}ococcus 
spp.,  from  each  depth.  Taxa  were  tentatively 
assigned  accordrig  to  the  origin  of  top  HSPs  in 
TBLASTX  searches,  followed  by  subsequent 
manual  inspection  and  curation. 


Sane  dcpth-spccific  gene  distributions  wc  ob¬ 
served  [e.g.,  transposascs  luund  prcdonunantly 
at  greater  deprtis  (Fig.  4B;  tig.  S8)],  appear  to 
originate  from  a  wide  variety  of  gene  families 
and  genomic  sources.  These  gene  distrihutional 
patterns  seem  more  indicative  of  habitat-specific 
genetic  or  physiological  trends  tliat  have  spread 
thniiigh  ditTerent  members  of  the  community. 
Communiiy  gene  disinbutions  and  stoichiome¬ 
tries  are  differentially  propagated  by  vertical  and 
horizontal  genetic  mechanisms,  dynamic  physi¬ 
ological  respvwises,  or  intcrspccics  intcrjctions 
like  competition.  The  overrepresentation  oT 
certain  sequence  types  may  sometimes  rctlcct 
their  horizontal  tranionission  and  propagation 
within  a  given  community.  In  our  ikiUisots,  the 
relative  abundance  of  cyaix>bacicrid-likc  /Av/x-t, 
/ivA/},  and  tnin.siiklolase  genes  were  largely  a 
consequence  of  their  horizontal  transfer  and 
subsequent  amplification  in  the  viru.scs  that 
uerc  cviptured  in  our  samples.  In  contrast,  the 
increase  of  transposascs  from  500  to  4000  m, 
rcgardic.ss  of  community  composition,  rc- 
fleeied  a  different  mtxfe  of  gene  propagation, 
likely  related  lo  ihc  slower  growth,  lower 


Percent  total  sequences  per  depth 
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productivity,  and  lower  effective  population 
sizes  of  deep-sea  microbial  communities.  In 
future  comparative  studies,  similar  deviations 
in  environmental  gene  stoichiometries  might 
he  expected  to  provide  even  further  insight 
into  habitat -specific  modes  and  mechanisms 
of  gene  propagation,  distribuiion,  and  mobil¬ 
ity  {27,  47).  These  '‘gene  ecologies”  could 
readily  be  mapped  dirccily  on  organ ismal 
distributions  and  interactions,  environmental 
variability,  and  taxonomic  distributions. 

TIk'  siudy  of  an- ironmcntal  adaptation  and 
vanahility  is  not  new,  but  our  technical  capa¬ 
bilities  for  identifying  and  tracking  sequences, 
genes,  and  metabolic  pathways  in  microbial 
communities  is.  The  study  of  gene  ecology  and 
its  relation  to  community  iiK'tabolism,  inicr- 
j^iecies  interactions,  and  hahitat-‘^cific  signa¬ 
tures  is  nascent.  More  extensive  sequencing 
efforts  are  certainly  required  to  more  tlxiroughly 
describe  natural  microbial  communities.  Addi¬ 
tionally,  more  corK'crtcd  efforts  to  integrate  these 
new  data  into  studies  of  oceanographic,  bio¬ 
geochemical,  and  environmenial  proces.ses  are 
necessary  (4.V).  As  the  scope  and  scale  of  genome- 


enabled  ecological  studies  matures,  it  should 
bcx'omc  ptissiblc  to  model  microbial  community 
genomic,  temporal,  and  spatial  variability  with 
other  envinuinicnial  fcatuies.  Significani  fuiurc 
aiteniinn  will  no  douhi  tbciis  on  inieipreiing 
the  complex  interplay  between  genes,  orga¬ 
nisms,  coniniunilies  and  the  environment,  as 
well  as  the  properties  revealed  that  regulate 
global  biogcochcmical  cycles.  Future  efforts 
in  this  area  will  advance  our  general  perspective 
on  microbial  ecology  and  evolution  and  elu¬ 
cidate  the  biological  dynamics  that  mediate 
the  llu.\  of  matter  and  energy  in  the  world’s 
oceans. 
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Pairing  and  Phase  Separation  in  a 
Polarized  Fermi  Gas 

Guthrie  B.  Partridge,  Wenhui  Li,  Ramsey  1.  Ka mar.  Yean -an  Liao,  Randall  G.  Hulet* 

We  report  the  observahon  of  pairing  in  a  gas  of  atomic  fermions  with  unequal  numbers  of  two 
components.  Beyond  a  critical  polarization,  the  gas  separates  into  a  phase  that  is  consistent  with  a 
superfluid  paired  core  surrounded  by  a  shell  of  normal  unpaired  fermions.  The  critical  polarization 
diminishes  with  decreasing  attractive  interaction.  For  near-zero  polarization,  we  measured  the 
parameter  [5  -  -0.54  •  0.05,  describing  the  universal  energy  of  a  strongly  interacting  paired  Fermi 
gas,  and  lound  good  agreement  with  recent  theory.  These  results  are  relevant  to  predictions  of 
exotic  new  phases  of  quark  matter  and  of  strongly  magnetized  superconductors. 


Fomiion  pairing  iho  essential  ingredient 
in  the  Bardeen.  CiMipcr.  and  SchrietTer 
(BCS)  thcorv'  of  superconductivity  In 
conventional  .superconductors,  the  chemical 
potentials  of  the  two  spin  states  are  eqiuil. 
There  has  been  great  interest,  however,  in  the 
consequences  of  mismatched  chemical  poten¬ 
tials  that  may  arise  in  several  important  sii- 
nations,  including,  for  example,  magnetized 
superconductors  |/  .?)  anti  ctild  dense  quark 
matter  at  the  core  of  neutron  siars  (4)  A 
clieinical  (H)tential  imbalance  may  tie  pnxlucetl 
hy  several  mechanisms,  including  magneti¬ 
zation  in  the  ca.se  of  superconductors,  mass 
asymmetry',  or  unequiil  numbers.  Pairing  is  qual 
itaiKcly  altered  by  the  Fermi  energy  mismatch, 
and  there  has  been  considerable  speculatitin 
regarding  the  nature  and  relative  stability  of 
various  proposed  c.xotic  phases.  In  the  Fuldc- 
Fcrrel-Larkin-Ovchinnikov  (FFLO)  phase  (J.  J). 
pairs  possess  a  non/cro  ccntcr-of-miiss  momen¬ 
tum  that  breaks  tninslational  inviu  iance,  wIkmcxis 
the  Samia  (/),  or  the  breached  pair  (  S).  phase 
is  speculated  to  have  gapless  c.xcitations  A 
mixed  phase  has  also  been  proposed  {6  K)  in 

OefMUment  of  Physitt  and  Astronomy  and  Rite  Quantum 
Insriule,  Rice  University,  Houston,  TX  77251.  USA. 

*To  whom  correspondence  should  be  addiessed.  E-mail: 
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vshich  regions  of  a  paired  BC  S  supertluid  arc 
sumninded  by  an  unpaired  normal  phase  Little 
is  knovsTi  experimentally,  however,  because  of 
the  difficulty  in  creating  magnetized  super- 
condiiclors.  Initial  evidence  for  an  FLL  <)  phase 
in  a  heavy- fen  11  ion  superconductor  h;is  only 
recently  been  reported  (V.  10).  Opportunities  lor 
cxpcnmcntal  invcsiigaiion  of  exotic  pairing 
suites  have  expanded  dramatically  with  the 
recent  realizatKin  of  the  Bose  I  jnstein  comlen 
sale  (BtCf  BCS  crossover  in  a  two  spin  suite 
mixture  of  ulliacold  atomic  gases.  Recent  ex- 
pen  rnents  have  demonstrated  both  siipertluid- 
ily  (//  /.?)  and  pairing  ii4  /^)  in  atomic  Fcmii 
gases.  We  report  the  observaiion  of  painng  in  a 
polarized  gas  of '’Li  atoms.  Above  an  interaction- 
defX'ndent  critical  polarizalHai,  we  observed  a 
phjise  sepiinitKin  ifuil  is  consistent  wilh  a  uni¬ 
formly  paired  superfluid  core  surrounded  by 
an  unpaired  shell  of  the  excess  spin  slate.  Be¬ 
low  the  critical  polariz^ilion,  the  spatial  size  of 
the  gas  was  in  agreement  with  cxpeelations 
for  a  universal,  sliongly  intenieling  p;iireil  I  emu 
gas. 

Our  methods  for  producing  a  degenerate 
gas  of  fermionic  ‘’Li  atoms  (/A.  19)  and  the 
rcalizaiion  of  the  BEC-BCS  crossover  at  a 
Fcshbach  resonance  {17)  have  been  described 
previously  (,’W).  An  incoherent  spin  mi.xturc 
of  the  F  -  '/2,  nty  -  '/j  (stale  ll ))  and  the  F  -  Vz, 


»»/p  -  (Stale  2  )  sublevels  (when:  F  is  the 
total  spin  quantum  number  aixl  is  its  projee- 
iKin)  is  crealett  hy  nidk)  tfeqnency  (rt)  sweeps, 
where  ihc  relative  number  of  the  nvo  states  can 
lx  coiiliolkxl  by  the  rf  power  [20)  The  spin 
niixlure  is  crealett  al  a  magnetic  field  of  "754  (L 
which  Ls  within  the  broad  Fcshbach  resonance 
located  near  83-1  Ci  [21.  22).  The  spin  niiMure  is 
cvaporarively  cooled  by  reducing  the  depth  of 
tlx  tifitK.'al  trap  that  confiixs  lU  and  tlx  mag- 
rxtic  field  is  rampetl  adialxitieally  to  a  tlesireit 
field  within  the  ciossovcr  region.  States  I  and 
2'  are  sequentially  and  independently  imaged 
in  the  trap  by  abstirption  t2(7)  Anaivsis  of 
dicsc  images  prov  ules  mcasuivment  of  .V^  and 
ptilarization  T  --  (,V,  .V,)'(.V,  -F  .V,).  where 

jV  IS  the  number  of  atoms  in  state  li\  VVe  ex¬ 
press  the  Fcniii  leiiiperaturc,  T^..  in  temis  of 
the  niajorily  spin  state,  stale  It  .  as  - 

hoi  (6A',)'  where  o)  -  2n  (u^-uj'  ^  is  the 
mean  haniionic  frequency  of  tlx  cylindrically 
symmetnc  confining  p<iiential  with  radial  and 
axial  frequencies  and  v^.  rcspeclivcly.  For 
F*  -  0,  we  find  tfuit  ,V,  -  .V,  ~  |()\  giving  I\.  - 
400  nK  tor  our  trap  frequencies.  Because  of 
decreasing  evaporation  ctTiciency  with  increas¬ 
ing  pidarizaiion,  there  is  a  correL'ition  between 
P  and  total  atom  number  (fig.  SI ). 

For  fields  on  tlx  low-ficlil  (BFC)  sule  of 
resonamx.  rejil  IwivKhK  hound  states  exist,  and 
molecules  arc  readily  formed  by  throe-body 
recombinaiion.  For  ihe  case  of  P  0,  a 
molecular  Bosc-Elt«cin  condensate  (MBFC)  ls 
observed  to  fixin  witli  nt»  ilclcxtabk:  tlxniul 
molecules  (/7).  ()n  the  hnsi.s  of  an  estimalcxl 
\SEC  condensate  fracnion  of  >*>0‘*b.  wc  place 
an  upper  limit  on  the  tcinpcTaiure  T  <■  0. 1 7,  at  a 
field  of  754  G  ( /  7),  However,  the  g;is  is  expcvled 
10  be  cooled  fiirther  during  the  adiabatic  ramp  tor 
firirit  fieltls  greater  thiin  754  <i  {!?)  By  using 
similar  experimental  methods,  we  previously 
mca.surLxi  the  onlcT  parameter  of  the  gas  in  the 
BCS  re*ginx  and  linind  gixid  agreement  with  T  - 
0  BCS  theory  (77),  indxaling  thiil  the  gas  was 
well  bckiw  the  cntxal  temperature  for  painng. 

Images  of  states  I)  and  I2)  at  a  field  of 
830  (j  arc  shown  (Fig.  I )  for  relative  numbers 
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Materials  and  methods 
Sampling  and  library  preparation 

Seawater  from  selected  depths  (Table  1,  main  text)  were  collected  at  the 
Hawaii  Ocean  Time  Series  (HOT)  station  ALOHA  (22.45®N,  158®W).  Multiple 
hydrocasts  for  sampling  and  measurement  used  a  Conductivity, 

Temperature,  Depth  (CTD)  rosette  water  sampler  equipped  with  24,  12  I 
polyvinyl  chloride  sample  bottles  aboard  the  R/V  Ka'imikai-o-Kanaloa. 

Sample  depths  were  selected  based  on  physical  (temperature,  pressure), 
chemical  (salinity,  dissolved  oxygen)  and  biotic  (chlorophyll  fluorescence) 
characteristics  in  real  time  from  CTD  data.  Seawater  samples  from  seven 
depths  were  collected  from  multiple  hydrocasts  using  a  CTD  system  equipped 
with  24,  12  I  polyvinyl  chloride  sample  bottles.  Samples  from  10  m  to  500  m 
were  collected  on  October  2002,  and  those  from  depths  of  770  m  and  4000 
m  on  December  2003.  The  seawater  was  pre-filtered  in  line  through  a  47  mm 
Whatman  glass  fiber  GFA  filter  (Millipore,  Bedford,  MA)  before  final  collection 
onto  0.22  m  Sterivex-GV  filter  (Millipore)  using  a  Masterflex  peristaltic  pump 
(Cole  Parmer  Instrument  Company,  Vernon  Hills,  IL).  From  one  to  four 
Sterivex  filters  were  used  at  each  depth,  depending  on  the  volume  of  sample 
filtered  (Table  1,  main  text).  The  glass  fiber  GFA  prefilters  were  replaced 
after  each  20  liters  filtered.  After  seawater  collection,  the  Sterivex  filters 
were  covered  with  0.5  ml  of  lysis  buffer  (50  mM  Tris*Hcl,  pH  8.3,  containing 
40mM  EDTA  and  0.75M  sucrose)  and  frozen  at  -80®C.  Samples  were 
transported  back  to  the  laboratory  on  dry  ice,  and  stored  at  -80®C  until  DNA 
extraction.  From  one  to  four  filters  were  extracted  and  the  total  DNA  pooled 
for  subsequent  fosmid  library  construction  from  each  depth. 

DNA  extraction  and  fosmid  library  construction  was  conducted  as 
previously  described,  with  minor  modifications  (SI).  Briefly,  a  solution  of 
proteinase  K  in  sterile  water  was  added  to  a  final  concentration  of 
0.5  mg-ml“^  into  the  Sterivex  filter  cartridge  (Fisher,  Fairlawn,  NJ),  followed 
by  addition  of  SDS  to  a  final  concentration  of  1%  (Sigma,  St  Louis,  MO).  The 
filter  cartridges  were  sealed  and  incubated  at  55®C  for  20  minutes,  followed 
by  further  incubation  at  70®C  for  5  minutes  to  further  promote  cell  lysis.  The 
lysate  was  remove  from  the  filter  cartridge,  and  nucleic  acids  were  extracted 
twice  with  phenol :chloroform:IAA  (25:24:1,  Sigma)  and  once  with 
chloroform: isoamyl  alcohol  (24: l,  Sigma).  After  concentration  of  the  crude 
nucleic  acids  by  spin  dialysis  using  a  Centricon  100  filter  (Millipore),  the  DNA 
was  further  purified  by  CsCI  buoyant  equilibrium  centrifugation,  as  previously 
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described. 

Environmental  DNA  was  cloned  using  the  CopyContror'^  Fosmid  Library 
Production  Kit  (Epicentre,  Madison,  WI)  following  the  manufacturer's 
protocol.  Briefly,  purified  DNA  was  end  repaired  according  to  the 
manufacturer  instructions,  and  size-fractionated  by  pulsed  field  gel 
electrophoresis  (PFGE)  on  a  CHEF-DR-II  system  (Bio-Rad,  Hercules,  CA) 
using  a  1%  SeaPlaque  GTG  agarose  (Cambrex,  Baltimore,  MD)  under  the 
following  conditions:  12®C,  6  V-cm"^  for  16  hrs  and  20-40  s  pulse  time  in  IX 
TAE  (40  mM  Tris-acetate,  1  mM  EDTA,  pH  8.0)  buffer.  The  gel  was 
subsequently  stained  with  SYBR  gold  (Molecular  Probes,  Eugene,  OR)  and 
viewed  on  a  Dark  Reader  transilluminator  (Clare  Chemical  Research, 
Dolores,  CO).  Gel  regions  containing  genomic  DNA  in  40-50  kbp  regions 
were  excised.  The  end-repaired  and  size-selected  DNA  from  gel  slices  was 
recovered  by  gelase  treatment  and  concentrated  and  washed  three  time 
with  an  equal  volume  of  TE  buffer  on  a  Centricon  100  (Millipore).  DNA  was 
ligated  into  the  CopyContror^  pCClFOS^^  vector,  packaged  in  vitro  using 
MaxPlax""^  Lambda  Packaging  Extracts,  and  transduced  into  E.  coli  EPISOO^*^ 
according  to  the  manufacturer's  instructions  (Epicentre,  Madison,  WI). 


Bacterial  and  archaeal  small  subunit  rRNA  screens 

To  survey  bacterial  small  subunit  rRNA  diversity,  E.cofi  host  chromosomal 
DNA  was  first  removed  from  the  fosmid  library  clone  pools  (12,000-18,000 
individually  grown  and  subsequently  pooled  clones  for  each  depth)  by  two 
rounds  of  CsCI  density  gradient  centrifugation  (S2).  CsCI  purified  clone  pool 
DNA  from  each  library  was  nuclease  treated  using  Plasmid  Safe 
exonuclease^"^  (Epicentre,  Madison,  WI)  following  the  manufacturers 
recommendations.  Aliquots  (250-300  ng)  of  the  E.  co//-free,  pooled  library 
DNA  was  subsequently  used  as  template  in  the  downstream  bacterial  SSL) 
rRNA  gene  amplification.  Reaction  mixtures  for  amplification  of  SSL)  rRNA 
gene  sequences  consisted  of  the  following:  250  ng  template  DNA,  0.2  mM 
dNTPs  each,  0.5  uM  each  forward  primer  27F  (5' 

AGAGTTTGATCMTGGCTCAG)  and  reverse  primer  1492R  (5' 
TACGGYTACCTTGTTACGACTT),  5  U  "Easy  A"  thermostable  proofreading 
polymerase  (Stratagene,  La  Jolla,  CA),  in  a  total  of  50  L  reaction  volume. 
Polyermase  chain  reaction  cycles  were  as  follows:  an  initial  denaturation  step 
of  2  minutes  at  94°C;  30  seconds  at  94®C,  30  seconds  at  55°C,  and  90 
seconds  at  72°C  for  a  total  of  15  amplification  cycles.  Reconditioning  PCR 
was  carried  out  to  reduce  heteroduplex  formation  (S3)  as  follows:  initial 
reaction  products  were  diluted  ten-fold,  and  re-amplified  using  parameters 
identical  to  the  above,  except  that  only  three  thermal  cycles  were  performed. 

Triplicate  PCR  reactions  were  pooled  and  cloned  using  a  TOPO  TA 
cloning  kit  (Invitrogen,  Carlsbad,  CA).  From  each  library  192  clones  were 
picked  and  plasmid  DNA  was  purified  with  an  automated  DNA  purification 
system  (AutoGen,  Holliston,  MA)  using  parameters  recommended  for  high- 
copy  plasmid  DNA.  Clone  inserts  were  sequenced  using  primers  27F  and 
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907R  (5'  CCGTCAATTCMTTTRAGTTT)  with  ABI  PRISM  BigDye  Terminator  v3.1 
cycle  sequencing  kit  (Applied  Biosystems,  Foster  City,  CA).  A  range  of  41-81 
bacterial  SSU  rRNA  gene  sequences  were  sequenced  from  each  library. 
Sequences  were  subsequently  aligned  to  a  database  in  ARB  (version  2.5b) 
(54)  and  assigned  to  the  nearest  taxonomic  affiliation  to  environmental  and 
cultivated  isolates.  In  total  351  bacterial  165  rRNA  gene  sequences  were 
analyzed  and  assigned  to  a  taxonomic  bin.  Sequences  were  analyzed  for 
chimeras  with  the  Bellerophon  server  (55)  using  the  Huber-Hugenholtz 
correction  and  a  300-bp  window  size. 

Archaeal  small  subunit  rRNA-containing  fosmids  were  identified 
directly  by  colony  hybridization  (52).  All  fosmid  clones  from  each  library 
were  arrayed  onto  positively  charged  nylon  membranes  using  a  Genetix 
QPix2Xt  automated  robot  and  processed  according  to  manufacturer's 
recommendations  (Genetix,  Hampshire,  UK).  Hybridization  was  carried  out 
at  60°C  with  PC R-gene rated,  non-isotopically  labeled  archaeal  rRNA- 
targeted  probes  using  AlkPhos  Direct  Labeling  and  ECF  Chemifluorescent 
Detection  kits  (Amersham  Biosciences,  Piscataway,  NJ).  Positive  clones  were 
visualized  on  a  Fuji  FLA-5100  fluorescent  image  analyzer  (Fuji  Life  Science, 
USA),  and  all  hybridization  positive  archaeal  rRNA-containing  fosmids  were 
picked  from  each  library,  and  sequenced  as  described  above,  using  Ar20F  (5' 
TTCCGGTTGATCCYGCCRG)  and  U1390R  (5'  GACGGGCGGTGTGTRC)  PCR 
amplification  primers,  and  )  and  internal  sequencing  primers  U530F  (5' 
GTGCCAGCMGCCGCGG)  and  Ar958R  (5' YCCGGCGTTGAMTCCAATT.  Reagent 
concentrations  were  identical  to  the  bacterial  amplifications  above  using 
forward  and  reverse  primers  Ar20F  (5'  TTCCGGTTGATCCYGCCRG)  and 
U1390R  (5'  GACGGGCGGTGTGTRC)  and  template  concentration  ranged  from 
50100  ng/reaction.  Cycling  parameters  were  as  follows:  an  initial 
denaturation  step  of  2  minutes  at  94°C;  30  seconds  at  94®C,  30  seconds  at 
60°C,  90  seconds  at  72°C,  for  a  total  of  25  amplification  cycles.  PCR 
products  were  purified  using  a  Montage  96-well  vacuum  system  (Mlllipore) 
according  to  the  manufacturers  protocol.  PCR  amplicons  from  archaeal  rRNA 
clones  from  each  library  were  sequenced  directly,  using  Ar20F  (5' 
TTCCGGTTGATCCYGCCRG)  and  U1390R  (5'  GACGGGCGGTGTGTRC)  PCR 
amplification  primers,  and  internal  sequencing  primers  U530F  (5' 
GTGCCAGCMGCCGCGG)  and  Ar958R  (5' YCCGGCGTTGAMTCCAATT)  using 
methods  as  stated  above,  yielding  on  average  1200  bp  unambiguous  DNA 
sequence.  Phylogenetic  trees  were  generated  using  ARB  (version  2.5b)  (54) 
and  PAUP4.0  (Sinauer  Associates,  Sunderland.  MA)  using  a  neighbor-joining 
method  with  1000  bootstrap  replicates. 

Cluster  analysis  of  cumulative  bitscore  comparisons  for 
pairwise  depth  comparisons 

Blast  searches  (TBLASTX)  of  all  sequences  from  one  depth  versus  all  from 
every  other,  were  used  to  estimate  cumulative  protein  sequence  differences 
existing  in  all  possible  depth  comparisons.  The  bitscores  of  the  top  high- 
scoring  pairs  (HSPs)  from  every  single  sequence  from  one  depth  versus 
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another  were  summed,  to  yield  a  cumulative  pairwise  bitscore  value.  The 
pairwise  cumulative  bitscore  values  from  all  possible  pairwise  sequence 
comparisons  of  one  depth  versus  another  were  then  used  to  construct  a 
distance  martrix  as  follows:  Cumulative  pairwise  bitscore  values  were 
normalized  by  dividing  each  by:  a)  the  cumulative  bitscore  value  derived 
from  the  sum  of  bitscore  values  in  self-self  TBLASTX  comparisons;  and  b)  the 
total  number  of  HSPs  in  any  given  comparison.  The  normalized,  cumulative 
bitscore  "similarities"  were  then  each  subtracted  from  one  to  derive  pseudo¬ 
distance  values,  and  construct  a  distance  matrix.  Distance  matrices  were 
analyzed  using  Phylip,  v  3.61  by  neighbor  joining  analysis  (56).  To  compare 
our  datasets  with  recently  reported  shotgun  data  from  the  Sargasso  Sea, 
10,000  sequences  from  each  Sargasso  Sea  sample  bin  were  randomly 
selected  (to  normalize  to  our  target-query  size  of  10,000  sequences),  and 
identical  analyses  were  conducted  (fig.  S8).  Clustering  patterns  were 
consistent  with  the  depth  of  origin  and  filtered  size  fraction  of  each  sample 
(fig.  S7),  with  sub-clustering  differentiating  the  Pacific  from  Atlantic  ocean 
photic  zone  datasets. 

Analysis  of  photic  zone  and  deep-water  unique  sequence  bins 

To  identify  sequences  characteristic  of  either  photic  zone  or  deep  water 
microbial  assemblages,  we  conducted  reciprocal  BLAST  (57)  comparisons 
between  each  individual  photic  zone  dataset  (10  m,  70  m,  and  130  m)  and  a 
pooled  deep-water  dataset  (e.g.,  all  500  m,  770  m,  4000  m  combined).  The 
annotation  tool  in  Pymood  (Allometra,  Davis,  CA)  software  was  used  to  parse 
and  identify  shared  sequence  bins.  All  sequences  unique  to,  or  shared 
between,  any  of  the  photic  zone  samples  to  the  exclusion  of  all  deep-water 
samples  identified  in  TBLASTX  searches  (expectation  cutoff  of  1x10'^)  were 
tabulated  and  pooled  using  the  Pymood  annotation  tool.  These  are  the  photic 
zone  unique  sequences  (PZ  In  Fig.  4)  were  then  analyzed  by  comparison  to 
well  curated  databases  (fig  S7A).  Similarly,  each  individual  deep-water 
sequence  dataset  (500m,  770m,  4000m)  were  reciprocally  compared  to  one 
another,  and  the  pooled  photic  zone  dataset  (all  10  m,  70  m,  130  m 
sequences  combined).  Deep-water  unique  sequences  (DW)  were  identified  in 
TBLASTX  (expectation  cutoff  of  1x10'^),  and  similarly  pooled  for  subsequent 
analyses  (as  in  fig.  S7b).  Likewise,  all  those  "core"  sequences  that  were 
present  in  and  shared  significant  similarity  (e-values  <  1x10'^)  In  all  six  data 
sets  (10  m,  70  m,  130  m,  500  m,  770  m,  4000  m),  were  identified  and 
pooled  as  described  above.  (In  these  analyses,  the  transition  depth  200  m 
between  the  photic  zone  and  deeper  waters  was  not  included). 

Once  identified,  the  PZ,  and  DW,  and  "core"  sequence  bins  were  each 
compared  to  the  KEGG,  COG,  NCBI  non-redundant  protein,  and  Sargasso 
sequence  databases  using  BLASTX,  TBLASTX,  or  BLASTN  (57).  Data  were 
parsed  and  ranked  according  to  top  HSPs  and  the  functional  annotations, 
expectation  values,  and  taxonomic  origins.  Sequences  associated  with 
specific  functional,  COG,  or  taxonomic  categories  at  specified  expectation 
value  thresholds  (Fig.  3,  figs.  S7,  S9-S12)  were  then  plotted  as  a  function  of 
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their  fractional  representation. 


Statistical  analysis  of  protein  category  representation 

Each  of  the  protein  sequence  collections  within  specific  categories  [based  on 
comparison  to  KEGG  pathways,  COG  gene  families,  or  SEED  Subsystems 
(S8)],  were  analyzed  to  identify  protein  categories  statistically  more  likely  to 
be  found  In  any  one  sample,  versus  any  another,  or  between  PZ  and  DW 
sequence  bins.  For  speed  and  reproducibility  we  adopted  a  bootstrap 
sampling  method  (S9).  First,  the  difference  between  median  Instances  of  any 
KEGG  subsystem,  KEGG,  Or  COG  category  In  the  dataset  was  calculated: 
10,000  proteins  were  sampled  from  each  sequence  bln,  and  for  each  pairwise 
comparison  the  difference  In  the  number  of  subsystems,  pathways,  or  gene 
families  was  calculated.  This  was  repeated  20,000  times,  and  the  median 
differences  calculated.  To  Identify  those  median  differences  that  were 
statistically  unlikely  to  have  occurred  by  random  chance,  this  process  was 
repeated  for  each  pairwise  bln,  except  the  10,000  proteins  were  sampled 
from  a  bln  at  random.  Again,  20,000  repeat  calculations  were  performed,  and 
the  data  organized  from  least  difference  to  most  difference.  The  confidence 
intervals  were  provided  by  the  appropriate  percentile  differences,  that  is  for 
99%  confidence  Intervals  the  1%  limit  was  provided  by  the  200th  difference 
and  the  99%  limit  was  provided  by  the  19,801st  difference  from  the  ordered 
list.  If  the  difference  of  medians  was  outside  these  limits,  the  subsystem, 
pathway,  or  protein  family  was  considered  to  have  a  statistically  significantly 
different  distribution  In  one  versus  another  dataset. 

This  method  allowed  for  rapid  calculation  of  the  differences  between 
subsystems,  pathways,  or  protein  families,  and  does  not  require  a  normal 
distribution  of  the  data.  Furthermore,  the  sample  size  and  repeat  size  can  be 
modified  to  approximate  the  size  of  the  datasets  Involved  In  the  analysis 


Viral  sequence  analyses 

Shotgun  sequences  were  ranked  by  the  expectation  values  of  their  top 
scoring  HSPs  from  blast  searches  using  blastx  (S7,  S8).  Less  stringent 
expectation  cutoff  values  (<10"^)  were  initially  used,  due  to  the  significant 
sequence  divergence  of  viral  genes,  to  Identify  potential  virus  sequences. 

To  estimate  the  number  of  phage  genomes  Integrated  Into  cellular 
hosts  (I.e.,  prophage),  the  number  of  sequences  with  top  scoring  HSPs  to 
known  temperate  phages  (e.g.,  lambdoid  siphoviruses),  prophages  and 
phage-related  integrase  genes  were  tabulated.  Additionally,  paired  end 
analyses  using  1,107  viral  fosmid  sequence  mate  pairs  were  conducted  using 
relatively  stringent  blastx  criteria  (e-values  better  than  10'®  for  phage  and 
10“^°  for  cellular  hits).  In  this  analysis,  fosmids  were  interpreted  to  be 
derived  from  prophages  when  one  end  was  similar  to  known  temperate 
phages  and  the  other  end  was  similar  to  a  cellular  gene.  Fosmids  with  both 
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termini  similar  to  known  temperate  phages  were  binned  as  temperate 
phages,  or  with  both  ends  similar  to  herpes  viruses,  binned  as  herpes 
viruses. 

Our  sampling  filter  fractionation  procedures  targeted  cells  and  not  free 
phage.  Nevertheless,  a  large  proportion  of  fosmid  ends  were  derived  from 
lytic  phage  DNA.  The  lytic  virus  DNA  in  fosmid  clone  libraries  has  two 
possible  origins:  intracellular  phage  DNA  recovered  from  infected  cells,  or 
free  phage  particles  that  adhered  to  particulate  material  on  the  collection 
filters.  Available  data  suggest  the  majority  of  cloned  phage  from  photic  zone 
samples  originated  from  infected  cells  :  First,  approximately  the  same 
number  of  cells  were  collected  at  each  depth  (Table  1),  so  enhanced  phage 
recovery  due  to  a  putative  increased  particle  loading  at  different  depths  (and 
therefore  increased  coincident  phage  adhesion),  does  not  explain  our  results 
(Fig  5). 

Second,  ratios  of  free  phage  particles  to  bacterioplankton  average 
about  10:1  in  marine  plankton,  and  are  relatively  constant  with  depth  (S9, 
510).  Hence,  variation  in  free  phage  with  depth  (and  hence  variable  depth 
recoveries),  also  does  not  explain  our  results  well.  Given  the  above 
considerations.  It  appears  likely  that  a  large  proportion  of  recovered  phage  in 
our  libraries  was  derived  from  virus-infected  cells,  and  not  free  phage 
particles  that  adhered  to  particulate  material  on  the  filters. 

The  percentage  of  cyanophage-infected  cells  in  our  samples  was 
approximated  as  described  below,  assuming  the  cyanophage  sequences  in 
our  samples  reflected  phage  in  the  process  of  infecting  host  cells.  The 
average  T7/T4  like  cyanophage  genome  is  about  4%  of  that  of  a  typical  (^'-2 
Mbp)  cyanobacterium.  This  translates  to  viral  genome:host  genome  ratios 
ranging  from  0.5:1  to  2.5:1  in  the  photic  zone  libraries.  Since  the  average 
burst  size  is  about  20-80  viruses/cell,  we  can  estimate  from  virus  sequence 
recovery  at  each  depth  that  the  percentage  of  infected  cyanobacteria  in  the 
samples  ranged  from  1  to  12%,  with  the  maximum  occurring  at  70  m  where 
the  virus:host  ratio  was  maximal 


Sequence  characterization  within  and  between  depths 

For  taxonomic  binning  (Fig.  1,  main  text),  BLASTX  was  used  to  compare 
the  set  of  all  predicted  protein  sequences  against  the  NCBI  nonredundant 
protein  database,  using  an  expectation  value  cut-off  of  <10‘®°.  Top  BLAST 
HSPs  in  this  bin  were  tabulated  according  to  the  NCBI  taxonomic  identifier 
for  each  sequence. 

Sequences  were  compared  to  the  KEGG  database  using  BLASTX  {57). 
Blast  results  were  tabulated  and  the  percentage  of  sequences  within  each 
KEGG  pathway  was  calculated  for  each  depth  interval.  Cluster  analysis  and 
"heat  maps"  were  generated  using  Cluster  3.0  (Sii)  using  the  C  Clustering 
Library  version  1.30  {512)  and  Java  TreeView 
(http://jtreeview.sourceforge.net). 

For  COG  assignments  at  each  depth  interval,  open  reading  frames  (orfs) 
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were  identified  using  automated  genome  annotation  software  fgenesb 
(Softberry,  Mount  Kisco,  NY).  Identified  orfs  from  each  sequence  were  then 
compared  to  the  COG  database  using  blastp  (57)  searches  with  an 
expectation  value  cut-off  of  ^  e'^.  Results  were  tabulated,  and  used  to 
determine  the  percentage  of  sequences  contained  in  each  COG  category  at 
each  of  the  seven  depth  intervals.  The  following  threshold  criteria  were  used 
in  determining  which  COGS  were  displayed  groups  in  the  cross-depth 
"heatmaps":  COGs  comprising  >0.2%  of  the  total  COG  counts  at  the  given 
depth  interval,  and  >  3-fold  change  difference  between  at  least  one  other 
depth  interval,  across  all  depths  compared. 
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fig  Sl.Contour  plots  of  upper  water  column  temperature,  chlorophyll  and  dissolved  oxygen  concentrations 
and  rates  of  primary  photoautotrophic  production  at  Station  ALOHA  (22?45'N.  1 58?W)  for  the  period  1 989-2004. 
White  dots  represent  the  positions  of  the  four  upper  ocean  (10,70, 130  and  200  m)  water  samples,  collected  in 
October  2002, that  were  analyzed  in  this  study. 
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fig  S2.  Vertical  profiles  of  relevant  physical  and  chemical  properties  at  Station  ALOHA  (22  deg  45'N,  1 58  deg  W). 
Shown  on  the  left  are  temperature,  salinity  and  water  density  and  shown  on  right  are  dissolved  oxygen  and 
nitrate  +  nitrite,  all  from  the  sea  surface  to  the  sea  bed  at  4750  m.  The  dashed  lines  show  the  positions  of  the 
three  deep  water  samples  (SOO,  770  and  4000  m)  analyzed  In  this  study. 
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fig.  S3.  Bacterial  SSU  rRNA  sequences  recovered  from  each  depth.  Phylogenetic  placement  of  rRNA  sequences 
contained  on  fosmid  clones  recovered  from  each  depth.  Sequences  (600-1 300  bp)  were  aligned  using  ARB, 
and  phylogenetic  placement  estimated  by  parsimony  analyses  (4).  fig.  S3  A,  Phylogenetic  positions  of 
Gammaproteobacteria-like  fosmids  based  on  rRNA  sequence. 
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fig.  S3B,  Phylogenetic  positions  of  Alphaproteobacteria-like  fosmids  based  on  rRNA  sequence. 
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fig. S3C  Phylogenetic  positions  of  fosmids  from  other  bacterial  groups  based  on  rRNA  sequences. 
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fig  S4.  Top  HSPs  matching  phylogenetically  identified  large  genome  fragments  previously  recovered  from 
marine  plankton.  Sequences  from  each  depth  were  searched  against  the  NCBI  non-redundant  protein 
database,  and  top  HSPs  matching  large  DNA  insert  plankton  clones  were  identified.  BLAST  searches 
were  performed  using  blastx.and  an  expectation  cutoff  value  of  le-60. 
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Percent  of  different  archaeal  rRNA-containing  fosmids  at  each  depth 


fig.SS.  Depth  distributions  of  archaeal  SSU  rRNA-containing  fosmids.  The  percentage  of  fosmid  clones  containing  an 
archaeal  SSU  rRNA  gene  from  each  depth. Clones  were  Identified  by  macroarray  colony  hybridization,  and  their  SSU 
rRNAs  KR  amplified  and  sequenced  (fig.S6).  All  fosm id-encoded  archaeal  rRNAs  were  associated  with  one  of 
three  groups:  Group  ICrenarchaea  (n=  37), Group  II  Euryarchaea  (n=  128),  or  Group  III  Euryarchaea  (n=  12). 
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fig  56.  Archaeal  SSU  rRNA  sequences  recovered  from  each  depth.  Phylogenetic  placement  of  archaeal  rRNA 
sequences  contained  on  fosmid  clones  recovered  from  each  depth  Sequences  were  aligned  using  ARB, and 
phylogenetic  placement  estimated  using  neighbor-joining  distance  calculations  with  Jukes-Cantor  correction 
(bootstrap  support  in  percentage  based  on  1000  replications  is  shown  at  nodes  of  >  60%)  (4). 
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fig  S7.  Cluster  analyses  of  KEGG  annotated  gene  sequences  recovered  from  different  depths.  The  percentage 
of  KEGG  annotated  sequences  found  In  each  category. The  yellow  shading  is  proportional  to  the  percentage  of 
identified  sequences  falling  into  each  KEGG  category  at  each  depth.  Dendrograms  were  constructed  as  described 
in  Figure  4,  main  text. The  percentage  of  COG  annotated  sequences  found  in  each  category.  Each  category  shown 
is  represented  in  at  >  0.3  %  of  the  total  KEGG-categorized  genes  at  every  depth.  For  display  purposes,  categories 
that  containing  >  2%  at  >  2depths  are  not  shown. 
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fig  S8.  Cluster  analyses  of  COG  annotated  gene  seqeunces  recovered  from  different  depths.  The  percentage 
of  COG  annotated  sequences  found  In  each  category.  The  yellow  shade  is  proportional  to  the  percentage  of 
identified  sequences  in  each  COG  category  at  each  depth.  Dendrograms  were  constructed  as  described  in 
Figure  4,  main  text.  Each  category  shown  contains  at  least  two  genes  >  0.5  %  of  the  total  KEGG  categorized 
genes  at  each  depth. 
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fig  S9.  Dendrogram  of  a  cumulative  TBLASTX  bitscore  distance  matrix  comparing  Sargasso  Sea  and  North  Pacific 
Subtropical  Gyre  samples.  Cluster  analyses  are  derived  from  pairwise  cumulative  TBLASTX  bitscore  comparisons, 
between  HOT  (this  study)  and  SargassoSea  microbial  community  samples  (13). The  dendrogram  was  generated 
as  described  above  in  Methods,  Supporting  Online  Materials.  Branching  patterns,  but  not  branch  lengths  are  not 
shown.  A  total  of  10,000  sequences  were  randomly  selected  from  each  of  the  seven  Sargasso  Sea  datasets  for 
comparison  to  similarly  sized  HOT  sequence  datasets.  Values  in  parenthesis  represent  the  size  fraction  collected 
In  each  sample,  in  microns. Colors  correspond  to  samples  collected  on  the  same  date. The  Sargasso  Sea  sample 
numbers  correspond  to  sequence  bins  from  individually  collected  seawater  samples  as  previously  reported  (1 3). 
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fig.SIOA. 


fig  SI  OB. 


fig  SI  0.  Comparison  of  North  Pacific  Subtropical  Gyre  photic  zone  (PZ)  and  deep  water  (DW)  sequence  sets  to 
Sargasso  Sea  surface  water  sequences.  fig.SIOA.  Percentage  of  PZ  or  DW  sequences  relative  to  their  respective  totals, 
that  match  Sargasso  Sea  samples  at  an  expectation  value  of  <  1  e-20  in  blastn  searches.  The  Sargasso  Sea  sample 
numbers  correspond  to  sequence  bins  from  individually  collected  seawater  samples,  as  previously  reported  (13) 
fig  SI  OB.  KEGG  category  distributions  of  PZ-related  sequences  in  combined  Sargasso  Sea  sample  bins.  Percentage 
of  the  total  KEGG-categorIzed  sequences  in  each  KEGG  category  present  is  shown.  Analyses  were  performed 
using  BLASTX  as  described  in  supporting  online  material  methods,  with  an  expectation  cutoff  threshold  of  Ixe-S. 


275 


fig.  SI  1 .  DeLong  et  al.,  l\/ls#1 1 20250,  Supplementary  Online  Material 


25.00 

□  PZ 
■  DW 

20.00 

15.00  r 

^  I 


10.00 


fig.  SI  1.  Taxon  categorization  of  top  HSPs  in  photic  zone  and  deepwater  sequence 
bins.  The  percentage  of  total  sequences  from  PZ  or  DW,with  top  HSPs  in  each  of 
the  taxon  categories  shown.  Sequences  were  compared  in  BLASTX  searches 
against  the  NCBI  non-redundant  protein  database,  using  an  expectation  cutoff 
value  1e-8  for  taxon  binning. 
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fig  51 2.  Alignment  of  cyanophage  fos mid- mate  pairs  along  a  fully  sequenced  cya nophage  genome,  fig  51 2 A.  Cyanophage  like  fosmid  end 
sequences  were  compared  to  the  complete  genome  of  cyanophage  PSSM-2  by  blastn  analyses  at  an  expectation  cutoff  of  1x10-3.  All  sequences 
matching  at  similarities  >  80%  identity  are  shown.  Colors  indicate  the  level  of  sequence  similarity.  Lines  split  single  sequences  that  match  the 
genome  at  high  similarity  in  different  regions  across  the  sequence,  fig  SI  2B.  Selected  fosmids  that  matched  PSSM-2  with  high  sequence  similarity 
on  both  fosmid  ends  were  aligr»ed  to  the  whole  phage  genome  sequence.  Colors  indicate  the  level  of  sequervie  similarity. 
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DEPTH 


Subsystem 

10  m 

70  m 

130  m 

200  m 

500  m 

770  m 

4000  m 

1,2-Dichloroethane  degradation 
[PATH:ot00631] 

1 

4 

1 

4 

1 

1 

I 

1-  and  2-Methylnaphthalene  degradation 
[PATH:ot00624]  ! 

1 

3 

2 

2  j 

2,4-Dichlorobenzoate  degradation 
[PATH:ot00623] 

) 

I 

2 

! 

ABC  transporters,  ABC-2  and  other  types 

1 

6 

1 

1 

1  I 

ABC  transporters,  eukaryotic 

1  ! 

3 

1 

I 

I 

ABC  transporters,  prokaryotic 

5  ' 

5 

ATP  synthesis  [PATH  :ot001 93] 

1 

ATPases 

1 

5 

Alanine  and  aspartate  metabolism 
[PATH:ot00252] 

4 

3  ! 

4 

1 

1 

Aminoacyl-tRNA  biosynthesis 
[PATH:ot00970] 

2 

3 

1 

1 

2 

Aminophosphonate  metabolism 
[PATH:ot00440] 

i 

2 

Aminosugars  metabolism  [PATH:ot00530] 

4 

2 

Arginine  and  proline  metabolism 
[PATH:ot00330] 

1 

1 

1 

Ascorbate  and  aldarate  metabolism 
[PATH:ot00053] 

1 

3 

2 

Atrazine  degradation  [PATH:ot00791] 

1  * 

2 

3 

2 

1 

 -j 

Bacterial  chemotaxis 

1 

Basal  transcription  factors  [PATH:ot03022] 

1 

Benzoate  degradation  via  CoA  ligation 
[PATH:ot00632] 

2 

3 

2 

1 

2 

Benzoate  degradation  via  hydroxylation 
[PATH:ot00362] 

2 

1 

2 

Biosynthesis  of  ansamycins  [PATH  :ot01 051] 

I 

i 

2 

Biosynthesis  of  siderophore  group 
nonribosomal  peptides  [PATH  :ot01 053] 

4 

1 

Biosynthesis  of  steroids  [PATH  :ot00 100] 

1 

1 

1  i 

1 

1 

Biotin  metabolism  [PATH:ot00780] 

1 

3 

Bisphenol  A  degradation  [PATH:ot00363] 

2 

1 

4 

Table  S1 .  Fosmid  end  sequeces  with  open  reading  frames  orhtologous  to  genes  in  specific  KEGG  and 
SEED  annotated  pathway  categories,  that  are  statistically  more  represented  at  one  depth  compared 
to  others  in  pairwise  comparisons.  Numbers  represent  the  number  of  parwise  comparisons 
with  other  depths  that  show  overrepresentation  at  the  depth  indicated. 
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DEPTH 


Subsystem 


10  m  70  m  130  m  200  m  500  m  770  m  4000  m 


Butanoate  metabolism  [PATH:ot00650] 

1 

3 

1 

1 

Calcium  signaling  pathway 

1 

; 

Caprolactam  degradation  [PATH:ot00930]  | 

1 

1 

1 

1 

2 

Carbon  fixation  [PATH:ot00710] 

1 

3 

1 

1  ^ 

Cell  division 

1 

1 

1 

1 

Citrate  cycle  (TCA  cycle)  [PATH:ot00020] 

 -  .  J 

 .  I 

5 

Cyanoamino  acid  metabolism 
[PATH:ot00460] 

4 

1 

Cysteine  metabolism  [PATH:ot00272] 

1 

2 

1 

.  3  J 

1 

Cytokines 

3 

- 1 

2 

2 

D-Alanine  metabolism  [PATH:ot00473] 

3 

3 

D-Glutamine  and  D-glutamate  metabolism  ! 
[PATH:ot00471]  | 

1 

1 

Diterpenoid  biosynthesis  [PATH:ot00904] 

5 

i 

Enzyme 

1 

1 

1 

1 

Fatty  acid  biosynthesis  (path  1 ) 
[PATH.otOOOei] 

1 

1 

1 

1 

1 

Fatty  acid  metabolism  [PATH;ot00071] 

4 

Flagellar  assembly  [PATH;ot02040] 

6 

Fluorene  degradation  [PATH:ot00628]  | 

2 

3 

1 

Folate  biosynthesis  [PATH:ot00790] 

5 

4 

1  1 

1  : 

 1 

Fructose  and  mannose  metabolism 
[PATH:ot00051]  , 

3 

3 

3 

1 

3 

Galactose  metabolism  [PATH:ot00052]  j 

1 

2 

 -J 

4 

2 

Ganglioside  biosynthesis  [PATH:ot00604]  | 

3 

i 

....  J 

Globoside  metabolism  [PATH:ot00603] 

2 

-t 

1  i 

1 

Glutamate  metabolism  [PATH:ot00251] 

1 

1 

2 

1 

1 

Glutathione  metabolism  [PATH:ot00480] 

3 

1 

1 

’  J 

1 

1 

Glycerolipid  metabolism  [PATH;ot00561] 

3  i 

, 1 

1 

1 

Glycerophospholipid  metabolism  I 

[PATH:ot00564] 

2 

J 

J 

i 

1  1 

.    „j 

Glycine,  serine  and  threonine  metabolism  ' 
[PATH:ot00260] 

1 

1 

] 

1 

3 

1 

Glycolysis  /  Gluconeogenesis 
[PATHiOtOOOlO] 

1 

1  ! 

J 

1  I 

j 

1 

1 

Glycosphingolipid  metabolism 
[PATH:ot00600] 

1 

! 

3 
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_ DEPTH _ 

10m  70  m  1 30  m  200  m  500  m  770  m  4000  m 


Glyoxylate  and  dicarboxylate  metabolism 
[PATH:ot00630] 

2 

4 

3 

3 

HTH  family  transcriptional  regulators 

2 

i 

4 

2 

Histidine  metabolism  [PATH:otCX)340] 

2 

1 

1 

Huntington's  disease  | 

5  1 

Inositol  metabolism  [PATH :ot00031] 

1 

2 

2 

2 

2 

Lipopolysaccharide  biosynthesis 
[PATH:ot00540] 

4 

5 

1 

Lysine  biosynthesis  [PATH:ot00300] 

I 

1 

-  -  -  J 

.  3  J 

Lysine  degradation  [PATH:ot00310] 

1 

2 

i 

..  ,  i 

2  ' 

2 

2 

Major  facilitator  superfamily  (MFS) 

3  I 

2  I 

' 

Methane  metabolism  [PATH:ot00680]  | 

1 

1 

3  1 

1  1 

3 

4 

Methionine  metabolism  [PATH:ot00271] 

1 

3  i 

1  1 

1 

.  I 

N-Glycan  biosynthesis  [PATH:ot00510] 

3  ! 

..  .  .  1 

2 

I 

I 

Nicotinate  and  nicotinamide  metabolism 
[PATH:ot00760]  ; 

1 

J 

I 

i 

Nitrobenzene  degradation  [PATH:ot00626] 

3  ' 

2 

Nitrogen  metabolism  [PATH :ot009 10] 

J 

J 

Non-enzyme  j 

2 

1 

i 

1 

2  j 

I 

Novobiocin  biosynthesis  [PATH:ot00401] 

J 

3 

i 

2 

1 

L 2  J 

Nucleotide  sugars  metabolism 
[PATH:ot00520] 

2 

2 

2 

1 

One  carbon  pool  by  folate  [PATH:ot00670] 

1  1 

1  i 

1 

5  t 

2  , 

Other  and  unclassified  family  transcriptional 
regulators 

1 

1 

1 

Other  ion -coupled  transporters 

1  I 

4 

4 

1 

Other  replication,  recombination  and  repair 
factors 

4 

4 

4 

I 

other  translation  factors 

2 

2 

1 

1 

1  i 

Other  transporters 

2 

2 

5 

2 

2 

Pantothenate  and  CoA  biosynthesis 
[PATH:ot00770] 

2 

1 

1 

Penicillins  and  cephalosporins  biosynthesis 
[PATH:ot00311] 

6 

Pentose  and  glucuronate  interconversions 
[PATH:ot00040] 

1 

2 

2 

2 

2 

Pentose  phosphate  pathway 
[PATH;ot00030] 

1 

3 
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_ DEPTH _ 

10m  70  m  1 30  m  200  m  500  m  770  m  4000  m 


Peptidoglycan  biosynthesis  [PATH:ot00550] 

4 

2 

1 

1 

Phenylalanine  metabolism  [PATH;ot00360] 

2 

2 

2 

3 

Phenylalanine,  tyrosine  and  tryptophan 
biosynthesis  [PATH:ot00400] 

2 

5 

1 

1 

1 

Phosphatidylinositol  signaling  system 
[PATH:ot04070] 

2 

Phosphotransferase  system  (PTS) 

1 

2 

3 

1 

Photosynthesis  [PATH:ot00195] 

4 

4 

4 

2 

Pores  ion  channels 

I 

I 

1 

1 

2 

Porphyrin  and  chlorophyll  metabolism 
[PATH:ot00860] 

1 

1  ' 

1 

Prion  disease 

1 

, 

1 

1 

1 

Propanoate  metabolism  [PATH;ot00640] 

1 

2 

1 

1  I 

1 

Proteasome  [PATH:ot03050] 

5 

Protein  export  [PATH:ot03060] 

2 

1 

2 

Protein  folding  and  associated  processing 

^ 

3 

Purine  metabolism  [PATH:ot00230] 

4 

vj 

j 

Pyrimidine  metabolism  [PATH:ot00240] 

4 

1  ' 

.  J 

3 

Pyruvate  metabolism  [PATH;ot00620] 

1 

Pyruvate/Oxoglutarate  oxidoreductases 

2 

RNA  polymerase  [PATH:ot03020] 

5 

1  ! 

3 

2  1 

Reductive  carboxylate  cycle  (C02  fixation)  ! 
[PATH:ot00720] 

1 

3  1 

1 

1 

Replication  complex 

1 

4 

1 

Restriction  enzyme 

4 

Riboflavin  metabolism  [PATH:ot00740] 

1 

1  1 

2 

3  i 

3 

1 

Ribosome  [PATH :ot03010]  \ 

5 

1 

3  j 

1 

1 

1 

Selenoamino  acid  metabolism 
{PATH;ot00450]  J 

1 

1 

1 

Sporulation 

2  : 

2 

Starch  and  sucrose  metabolism 
[PATH;ot00500]  ^ 

1 

2 

1 

1  ; 

1 

Stilbene,  coumarine  and  lignin  biosynthesis  : 
[PATH:ot00940]  J 

2  I 

. — 

Streptomycin  biosynthesis  [PATH:ot00521] 

1 

1 

1 

2  ■ 

1 

1  i 

Styrene  degradation  [PATH:ot00643] 

2 

Tetracycline  biosynthesis  [PATH;ot00253] 

1 

3 

j 

-  J 

I 

i 
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_ DEPTH _ 

Subsystem  10m  70m  130m  200  m  500  m  770  m  4000  m 


Thiamine  metabolism  [PATH:ot00730] 

1  1 

1  I 

1 

4  1 

1 

1 

Tight  junction 

5  J 

1 

1 

Translation  factors  | 

1 

2 

5 

1 

1 

3  ; 

1 

Tryptophan  metabolism  [PATH:ot00380]  ^ 

4^ 

4 

J 

J 

1 

[.  J 

-  .  1 

Two-component  system  j 

2 

1 _ 

1 

1 

1 _ ■ 

- r 

2 

Type  II  secretion  system  [PATH:ot03090] 

1  1 

i 

1  1 

1  ' 

1  i 

1 

Type  III  secretion  system  [PATH:ot03070] 

2  : 

! 

_ j 

Tyrosine  metabolism  [PATH:ot00350] 

1  i 

4 

1 

2 

2  j 

1 

Ubiquinone  biosynthesis  [PATH:ot00130] 

1  j 

2 

L   J 

2 

Urea  cycle  and  metabolism  of  amino  groups 
[PATH:ot00220] 

1 

1  I 

 J 

6 

1 

- ! 

1 

I 

- 1 

1 

1  j 

Valine,  leucine  and  isoleucine  biosynthesis 
[PATH:ot00290] 

1  ! 

5 

1 

i 

1  I 

J 

Valine,  leucine  and  isoleucine  degradation 
[PATH:ot00280] 

3 

1 

5 

1 

1 

1 

Vitamin  B6  metabolism  [PATH:ot00750] 

2 

beta-Alanine  metabolism  [PATH:ot00410]  | 

1  i 

6 

1 

i 

I 

gamma- Hexachiorocyciohexane  degradation! 
[PATH:ot003611  1 

3 

I 
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